JoshCheek commented 10 years ago

Versions:

SeeingIsBelieving::VERSION  "2.1.4"
Parser::VERSION             "2.2.0.pre.4"
RUBY_VERSION                "2.1.1"
ENV['RUBY_VERSION']         "2.1.1"

Input program

require 'json'  # => false

initial = "”"  # => "”"

result =
  JSON.dump(  # => JSON
    initial   # => "”"
  )           # => "\"”\""

JSON.parse result

Wrapped program

begin; $SiB.number_of_captures = Float::INFINITY; $SiB.record_result(1, (require 'json'))

$SiB.record_result(3, (initial = "”"))

$SiB.record_result(8, (result =
  $SiB.record_result(6, (JSON)).dump(
    $SiB.record_result(7, (initial))
  )))

$SiB.record_result(10, (JSON.parse result));rescue Exception;lambda {line_number = $!.backtrace.grep(/#{__FILE__}/).first[/:\d+/][1..-1].to_i;$SiB.record_exception line_number, $!;$SiB.exitstatus = 1;$SiB.exitstatus = $!.status if $!.kind_of? SystemExit;}.call;end

Stdout from execution

Is empty.

Stderr from execution

/Users/josh/.rubies/ruby-2.1.1/lib/ruby/2.1.0/json/common.rb:223:in `encode': "\xE2" from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError)
    from /Users/josh/.rubies/ruby-2.1.1/lib/ruby/2.1.0/json/common.rb:223:in `generate'
    from /Users/josh/.rubies/ruby-2.1.1/lib/ruby/2.1.0/json/common.rb:223:in `generate'
    from /Users/josh/.rubies/ruby-2.1.1/lib/ruby/2.1.0/json/common.rb:394:in `dump'
    from /Users/josh/.gem/ruby/2.1.1/gems/seeing_is_believing-2.1.4/lib/seeing_is_believing/the_matrix.rb:39:in `block in <top (required)>'

Exit status

1

JoshCheek commented 9 years ago

This is an encoding issue. The inspected string is sometimes coming back as #<Encoding:UTF-8>, and sometimes as #<Encoding:US-ASCII>. When I then go to add the inspections as annotations, it blows up inside parser. Looks like I can call force_encoding on the inspected value, here, to get it to not blow up. For some reason, though, this causes the stack overflow test to blow up. Not sure why, but it would be good for these failures to show up at lib level tests. Also, I might be able to force the encoding on the consumer side instead of the producer side.

JoshCheek commented 9 years ago

Synopsis

Rough time figuring this out. It boils down to this: A UTF8 string whose byte representation includes bytes outside the ASCII range, being incorrectly encoded as ASCII-8BIT (a byte string), and then trying to transcode to UTF8. Because some bytes are not valid ASCII, it blows up.

# encoding: utf-8
"ç".force_encoding(Encoding::ASCII_8BIT)  # => "\xC3\xA7"
   .encode(Encoding::UTF_8)               # ~> Encoding::UndefinedConversionError: "\xC3" from ASCII-8BIT to UTF-8

How it got encoded incorrectly

How does that force_encoding happen? Well, the JSON stdlib's error message includes the string it was attempting to parse, but the error message itself is in ASCII-8BIT (I don't know why, I'll probably open a bug report). In other words, it thinks the error message is a byte string (notice how it inspects).

require 'json'  # => true

str = "√"                                           # => "√"
msg = (JSON.parse JSON.dump str rescue $!.message)  # => "757: unexpected token at '\"\xE2\x88\x9A\"'"

str.encoding  # => #<Encoding:UTF-8>
msg.encoding  # => #<Encoding:ASCII-8BIT>

How strings try to transcode to fix the issue

Okay, so how does it try to transcode itself? Apparently, when two strings need to be concatenated (as we are doing when we append the errors into comments in the original text), it will try transcoding one to the other. First trying to make the RHS into the LHS, if that fails, trying to make the LHS into the RHS. If that fails, it blows up as we see in this issue.

# encoding: UTF-8

def a8b(string)
  string.force_encoding(Encoding::ASCII_8BIT)
end

# ENCODINGS MATCH: no transcoding
  ("a" + "å")           # => "aå"
  ("a" + "å").encoding  # => #<Encoding:UTF-8>

# ENCODINGS ARE COMPATIBLE: the RHS is transcoded to the LHS
  ("a" + a8b("a"))           # => "aa"
  ("a" + a8b("a")).encoding  # => #<Encoding:UTF-8>

  (a8b("a") + "a")           # => "aa"
  (a8b("a") + "a").encoding  # => #<Encoding:ASCII-8BIT>

# RHS IS NOT COMPATIBLE WITH LHS: the LHS is transcoded to the RHS
  # example1: not compatible b/c å is multibyte
  (a8b("") + "å")            # => "å"
  (a8b("a") + "å").encoding  # => #<Encoding:UTF-8>

  # example2: not compatible b/c "\xC3\xA5" is a string of two bytes with values 195 and 165
  # since ASCII only has values 0-127, these are not valid ASCII values.
  # So the RHS has no idea what these are supposed to be and can't change encodings.
  # So the LHS changes its encoding to ASCII-8BIT, because it has a valid ascii representation (97)
  ("a" + a8b("å"))           # => "a\xC3\xA5"
  ("a" + a8b("å")).encoding  # => #<Encoding:ASCII-8BIT>

# NEITHER RHS NOR LHS ARE COMPATIBLE: explosions
  # this is where we find ourselves
  ("å" + a8b("å"))  # ~> Encoding::CompatibilityError: incompatible character encodings: UTF-8 and ASCII-8BIT

Okay, but how did we get the error message?

The JSON lib will emit an invalid object while dumping (apparently, toplevel JSON value can only be an object or an array ...for some reason). However, rather than blowing up when asked to dump invalid JSON, it blows up when asked to parse it. Well... unless you tell it that the lib which generated this JSON is ...quirky.

require 'json'  # => false

json = JSON.dump("√")                # => "\"√\""
JSON.parse(json) rescue $!.message   # => "757: unexpected token at '\"\xE2\x88\x9A\"'"
JSON.parse(json, quirks_mode: true)  # => "√"

Summary

User code (it was mine) erroneously dumps "√" to JSON and get the invalid JSON "\"√\"' back.
User code passes that to the parser, which blows up and raises the error message unexpected token at '\"√\"'
The JSON stdlib incorrectly encoded the error message as ASCII-8BIT, thus it actually looks like unexpected token at '\"\xE2\x88\x9A\"'
SiB sees and records it
Then sends it back to the main process as a base 64 encoded marshall dump.
The main process decodes it back into an incorrect ASCII-8BIT string and stores it on a result object
And then adds it as a comment into the original source
Which uses a rewriter to insert the invalid ASCII-8BIT error message as a comment into the multibyte UTF-8 source code.
Thus, down in the bowls of the rewriter these two incompatible encodings finally collide and explode.

How to fix it

Any string that exists at this level is SiB data and should be encoded accordingly. I think that in addition to calling to_s on it, we should try first transcoding it, and then force encoding it to whatever the other side's encoding is (presumably UTF8, but I'm not sure that's legit)

JoshCheek commented 9 years ago

Bug report opened here

J-Swift commented 9 years ago

Thanks for posting this detailed report Josh. You saved me a bunch of time trying to figure this out myself for a project I'm working on.

JoshCheek commented 9 years ago

lol, np. When it takes me a lot of effort to figure something out, I try to document it in that moment, so I don't have to re-experience it later ^_^ Glad I was able to save someone else this confusion!

stephan-nordnes-eriksen commented 9 years ago

I just came across this issue myself. Is there any solution that will allow me to parse UTF-8 strings using the ruby json parser?

I think the solution is here: https://github.com/ohler55/oj

JoshCheek commented 9 years ago

It does parse UTF-8 strings. The problem here was that the string wasn't valid UTF-8, so it tried to fix the encoding and couldn't.

stephan-nordnes-eriksen commented 9 years ago

I could not get it to successfully parse JSON.parse(JSON.dump("√")). Maybe I don't fully understand the issue, but I ended up converting to the Oj gem.

It does seem to handle anything I have been able to throw at it thus far, but it behaves a bit differently, so you need to enable the mode to Oj.default_options = {:mode => :compat } to make it work similarly to the normal JSON. A plus is that it is apparently faster as well.

JoshCheek commented 9 years ago

That's a different issue: it's blowing up because "√" is not valid as a toplevel JSON value.

JSON only considers arrays and objects (ruby hashes) to be valid toplevel objects. The spec does a bad job of conveying this, but at the top of http://json.org/ it says:

JSON is built on two structures:

A collection of name/value pairs. In various languages, this is realized as an object, record, struct, dictionary, hash table, keyed list, or associative array.

An ordered list of values. In most languages, this is realized as an array, vector, list, or sequence.

You can see it in Ruby's parsing code here.

The default JSON parser can deal with this, it just doesn't by default, because that's apparently nonstandard. To get it to parse this, pass the key quirks_mode when you parse it.

$ ruby -rjson -e 'p JSON.parse(JSON.dump("√"), quirks_mode: true)'
"√"

stephan-nordnes-eriksen commented 9 years ago

Interesting. Strange that basic values are invalid JSON. Does make sense though "JavaScript Object Notation". If I was designing this though I would have added those as valid syntax "some string" feels so wrong to be invalid. The same goes for 123 and true. Why are arrays valid then? Shouldn't those be values inside an object similar to strings? Weird.

Anyways, thanks a lot for clarifying!

JoshCheek commented 8 years ago

Just found this wonderful resource about encodings, adding it here b/c this is where I go when I get confused about them.

JoshCheek / seeing_is_believing

JSON.parse/encoding error #46

Versions:

Input program

Wrapped program

Stdout from execution

Stderr from execution

Exit status

Synopsis

How it got encoded incorrectly

How strings try to transcode to fix the issue

Okay, but how did we get the error message?

Summary

How to fix it