“Gibberish” false positive when parsing Unicode identifiers in objects

lynn commented 4 years ago

This sanity check keeps demjson from correctly parsing {あ:2}. When I simply disable it, the parse is successful. Unicode identifiers are valid ECMAScript, and most definitely not gibberish, so demjson should probably not throw when the data starts with {あ.

Seems better to skip this sanity check and let the parse fail in another way if the data really is invalid.

daira commented 4 years ago

Not to mention that "gibberish" is never an appropriate way to describe texts because they're in non-Latin script.

gwenya commented 4 years ago

That sanity check does indeed seem rather useless (any encoding errors will hopefully be caught at a later point anyways) and wrong in at least two ways: It appears to rely on the RFC's statement that "the first two characters of a JSON text will always be ASCII characters" but ignores that A) when the RFC refers to "JSON text", it means a JSON object or array, not a simple value like a string and B) non-strict mode allows arbitrary unicode in unquoted identifiers.

The point of that statement in the RFC is not to allow more-or-less meaningless sanity checks but to show a way to detect the encoding (UTF-8/16/32) based on only the first two characters, and only when the input is a JSON object or array.

I think the sanity check would do the correct thing if it was only used in strict mode, however there is a good reason to remove it there too: It completely fails its stated goal of "rais[ing] a suitably descriptive error rather than an obscure syntax error later on", since the error message indicates improper unicode encoding while the actual problem is just as likely to actually be a syntax error (e.g. trying to decode something like {あ:2} in strict mode).

Kakurady commented 4 years ago

I think the sanity check would do the correct thing if it was only used in strict mode, however there is a good reason to remove it there too

Indeed, ECMA-404 (since the 1st edition) and RFC 8259 both allow any value to appear at the top level of a JSON document, not just Arrays and Objects as specified in RFC 4627. This makes "あ" valid JSON text, yet it fails the sanity check.

gwenya commented 4 years ago

Oh yeah, you are right. It took me a bit to look up the spec and write that up, and by the time I wrote that last part I had already forgotten that the sanity check is also used on values like strings.

EDIT: Although, the check does allow anything if the first character is a quote I believe, so it should not fail. Might have misread that part of the code though.

dmeranda / demjson

“Gibberish” false positive when parsing Unicode identifiers in objects #36