Closed JoshCheek closed 9 years ago
This is an encoding issue. The inspected string is sometimes coming back as #<Encoding:UTF-8>
, and sometimes as #<Encoding:US-ASCII>
. When I then go to add the inspections as annotations, it blows up inside parser. Looks like I can call force_encoding
on the inspected value, here, to get it to not blow up. For some reason, though, this causes the stack overflow test to blow up. Not sure why, but it would be good for these failures to show up at lib level tests. Also, I might be able to force the encoding on the consumer side instead of the producer side.
Rough time figuring this out. It boils down to this: A UTF8 string whose byte representation includes bytes outside the ASCII range, being incorrectly encoded as ASCII-8BIT (a byte string), and then trying to transcode to UTF8. Because some bytes are not valid ASCII, it blows up.
# encoding: utf-8
"ç".force_encoding(Encoding::ASCII_8BIT) # => "\xC3\xA7"
.encode(Encoding::UTF_8) # ~> Encoding::UndefinedConversionError: "\xC3" from ASCII-8BIT to UTF-8
How does that force_encoding
happen? Well, the JSON stdlib's error message includes the string it was attempting to parse, but the error message itself is in ASCII-8BIT (I don't know why, I'll probably open a bug report). In other words, it thinks the error message is a byte string (notice how it inspects).
require 'json' # => true
str = "√" # => "√"
msg = (JSON.parse JSON.dump str rescue $!.message) # => "757: unexpected token at '\"\xE2\x88\x9A\"'"
str.encoding # => #<Encoding:UTF-8>
msg.encoding # => #<Encoding:ASCII-8BIT>
Okay, so how does it try to transcode itself? Apparently, when two strings need to be concatenated (as we are doing when we append the errors into comments in the original text), it will try transcoding one to the other. First trying to make the RHS into the LHS, if that fails, trying to make the LHS into the RHS. If that fails, it blows up as we see in this issue.
# encoding: UTF-8
def a8b(string)
string.force_encoding(Encoding::ASCII_8BIT)
end
# ENCODINGS MATCH: no transcoding
("a" + "å") # => "aå"
("a" + "å").encoding # => #<Encoding:UTF-8>
# ENCODINGS ARE COMPATIBLE: the RHS is transcoded to the LHS
("a" + a8b("a")) # => "aa"
("a" + a8b("a")).encoding # => #<Encoding:UTF-8>
(a8b("a") + "a") # => "aa"
(a8b("a") + "a").encoding # => #<Encoding:ASCII-8BIT>
# RHS IS NOT COMPATIBLE WITH LHS: the LHS is transcoded to the RHS
# example1: not compatible b/c å is multibyte
(a8b("") + "å") # => "å"
(a8b("a") + "å").encoding # => #<Encoding:UTF-8>
# example2: not compatible b/c "\xC3\xA5" is a string of two bytes with values 195 and 165
# since ASCII only has values 0-127, these are not valid ASCII values.
# So the RHS has no idea what these are supposed to be and can't change encodings.
# So the LHS changes its encoding to ASCII-8BIT, because it has a valid ascii representation (97)
("a" + a8b("å")) # => "a\xC3\xA5"
("a" + a8b("å")).encoding # => #<Encoding:ASCII-8BIT>
# NEITHER RHS NOR LHS ARE COMPATIBLE: explosions
# this is where we find ourselves
("å" + a8b("å")) # ~> Encoding::CompatibilityError: incompatible character encodings: UTF-8 and ASCII-8BIT
The JSON lib will emit an invalid object while dumping (apparently, toplevel JSON value can only be an object or an array ...for some reason). However, rather than blowing up when asked to dump invalid JSON, it blows up when asked to parse it. Well... unless you tell it that the lib which generated this JSON is ...quirky.
require 'json' # => false
json = JSON.dump("√") # => "\"√\""
JSON.parse(json) rescue $!.message # => "757: unexpected token at '\"\xE2\x88\x9A\"'"
JSON.parse(json, quirks_mode: true) # => "√"
"√"
to JSON and get the invalid JSON "\"√\"'
back.unexpected token at '\"√\"'
unexpected token at '\"\xE2\x88\x9A\"'
Any string that exists at this level is SiB data and should be encoded accordingly. I think that in addition to calling to_s
on it, we should try first transcoding it, and then force encoding it to whatever the other side's encoding is (presumably UTF8, but I'm not sure that's legit)
Thanks for posting this detailed report Josh. You saved me a bunch of time trying to figure this out myself for a project I'm working on.
lol, np. When it takes me a lot of effort to figure something out, I try to document it in that moment, so I don't have to re-experience it later ^_^ Glad I was able to save someone else this confusion!
I just came across this issue myself. Is there any solution that will allow me to parse UTF-8 strings using the ruby json parser?
I think the solution is here: https://github.com/ohler55/oj
It does parse UTF-8 strings. The problem here was that the string wasn't valid UTF-8, so it tried to fix the encoding and couldn't.
I could not get it to successfully parse JSON.parse(JSON.dump("√"))
. Maybe I don't fully understand the issue, but I ended up converting to the Oj gem.
It does seem to handle anything I have been able to throw at it thus far, but it behaves a bit differently, so you need to enable the mode to Oj.default_options = {:mode => :compat }
to make it work similarly to the normal JSON. A plus is that it is apparently faster as well.
That's a different issue: it's blowing up because "√"
is not valid as a toplevel JSON value.
JSON only considers arrays and objects (ruby hashes) to be valid toplevel objects. The spec does a bad job of conveying this, but at the top of http://json.org/ it says:
JSON is built on two structures:
- A collection of name/value pairs. In various languages, this is realized as an object, record, struct, dictionary, hash table, keyed list, or associative array.
- An ordered list of values. In most languages, this is realized as an array, vector, list, or sequence.
You can see it in Ruby's parsing code here.
The default JSON parser can deal with this, it just doesn't by default, because that's apparently nonstandard. To get it to parse this, pass the key quirks_mode
when you parse it.
$ ruby -rjson -e 'p JSON.parse(JSON.dump("√"), quirks_mode: true)'
"√"
Interesting. Strange that basic values are invalid JSON. Does make sense though "JavaScript Object Notation". If I was designing this though I would have added those as valid syntax "some string"
feels so wrong to be invalid. The same goes for 123
and true
. Why are arrays valid then? Shouldn't those be values inside an object similar to strings? Weird.
Anyways, thanks a lot for clarifying!
Versions:
Input program
Wrapped program
Stdout from execution
Is empty.
Stderr from execution
Exit status
1