francoispqt / gojay

high performance JSON encoder/decoder with stream API for Golang
MIT License
2.11k stars 112 forks source link

string unmarshal does not recognize esacpe sequences #41

Closed maj-o closed 6 years ago

maj-o commented 6 years ago

Test:

import (
"testing"

"github.com/francoispqt/gojay"
)

func TestGC(t *testing.T) {
var a string = "d:\\test \\r go to next line" // also `d:\test`
var b string

t.Log(a)  // >> d:\test \r go to next line

a_json, _ := gojay.Marshal(a)

t.Log(string(a_json)) // ok >> "d:\\test \\r go to next line"  

gojay.Unmarshal(a_json, &b)

t.Log(b) // ERR >> d:        est
//go to next line
}

Please also check if "this is a \t tab this is a path d:\\test" works. There seems to be en error in parseEscapedString. Could not realy find it - dec.cursor is increased twice - maybe this helps. If You remove this line, simple escape sequences are handled correct:

start := dec.cursor
for ; dec.cursor < dec.length || dec.read(); dec.cursor++ { 
    if dec.data[dec.cursor] != '\\' {
        d := dec.data[dec.cursor]
// >>>> dec.cursor = dec.cursor + 1            <<<< this may be to much
        nSlash := dec.cursor - start
francoispqt commented 6 years ago

Hi,

This is actually the expected behaviour. It's part of the JSON RFC. The golang json standard library encoding/json behaves the exact same way, check it in this go playground example: https://play.golang.org/p/UmGBliOAgBc

I'm closing.

maj-o commented 6 years ago

RFC 8259 chapter 7:

"...So, for example, a string containing only a single reverse solidus character may be represented as "\u005C".

Alternatively, there are two-character sequence escape representations of some popular characters. So, for example, a string containing only a single reverse solidus character may be represented more compactly as '\\'."

If I marshal a string any "\" is marshalled to "\\". I think it's legal to expect, that unmarshalling the result, gives same string as used for input.

I know that standard-lib has errors,

My problem is a real world problem - windows uses \ for paths - so please, think about a usable solution. Either marshalling to \u005C or to unmarshall correctly.

francoispqt commented 6 years ago

Sorry for the delay. From the RFC:

char = unescaped /
          escape (
              %x22 /          ; "    quotation mark  U+0022
              %x5C /          ; \    reverse solidus U+005C
              %x2F /          ; /    solidus         U+002F
              %x62 /          ; b    backspace       U+0008
              %x66 /          ; f    form feed       U+000C
              %x6E /          ; n    line feed       U+000A
              %x72 /          ; r    carriage return U+000D
              %x74 /          ; t    tab             U+0009
              %x75 4HEXDIG )  ; uXXXX                U+XXXX

This basically means that if you want to represent a line return (for example) in JSON you must use this notation "\n". Which means that, when decoding, if you find this notation the expected decoded string should have a line return char.

To confirm my idea, I performed JSON encoding/decoding in multiple dynamic languages with builtin parsers (JS, Python, Ruby) and they all behave this way.

maj-o commented 6 years ago

Thank You for answering though it is closed.

But exactly that is my point. Encoding works. Decoding any encoded string or stream containing \\ or 005c is wrong. If this is followed by r t ... it is a disaster. Though there are some problems since a month with structure size and thradsafty I like gojay. So I replace any known \\ with // and back on over side. And yes I know that std lib also is buggy. I don't believe I am writing this: Microsoft std since Framework 4.52 is working as expected. Anyway I'll go with gojay.

chuigda commented 1 year ago

I don't believe I am writing this: Microsoft std since Framework 4.52 is working as expected.

It's 2023 now and every JSON library of every programming language, except for gojay, is working as expected. Hard to believe it took me a whole afternoon debugging this.