brianmario / yajl-ruby

A streaming JSON parsing and encoding library for Ruby (C bindings to yajl)
http://rdoc.info/projects/brianmario/yajl-ruby
MIT License
1.48k stars 169 forks source link

invalid bytes in UTF8 string #64

Closed lucasallan closed 13 years ago

lucasallan commented 13 years ago

Yajl don't accepted accented characters, any idea?

Yajl::ParseError (lexical error: invalid bytes in UTF8 string. 724,"name":"Residencial Gaudí"}} (right here) ------^ ):

brianmario commented 13 years ago

can you paste the original input string in here after calling .inspect on it so all the binary data gets escaped?

lucasallan commented 13 years ago

The original string is "Residencial Gaudí",

ruby-1.9.2-p180 :001 > s = "Residencial Gaudí" => "Residencial Gaudí" ruby-1.9.2-p180 :003 > s.inspect => "\"Residencial Gaudí\""

brianmario commented 13 years ago

I don't see an accent character in there? Are you sure that's the exact original string?

lucasallan commented 13 years ago

The string is "Residencial Gaudí",

the last letter has an accent " í "

But if I use any other letter with accent like '^ ~ ' ' or letters like 'ç' the same problem happens.

brianmario commented 13 years ago

Could you try calling .bytes.to_a on the string and paste me the output?

lucasallan commented 13 years ago

ree-1.8.7-2011.03 :002 > "Residencial Gaudí".bytes.to_a => [82, 101, 115, 105, 100, 101, 110, 99, 105, 97, 108, 32, 71, 97, 117, 100, 195, 173]

and

ree-1.8.7-2011.03 :003 > "caça praça êé".bytes.to_a => [99, 97, 195, 167, 97, 32, 112, 114, 97, 195, 167, 97, 32, 195, 170, 195, 169]

That string coming in a json from http request. I have a rails controller and a android app send a post with json and that error happens.

brianmario commented 13 years ago

I'm able to parse the strings without error:

ree-1.8.7-2011.03 :010 > require 'yajl'
 => false 
ree-1.8.7-2011.03 :011 > str = "\"#{[82, 101, 115, 105, 100, 101, 110, 99, 105, 97, 108, 32, 71, 97, 117, 100, 195, 173].map{|c| c.chr}.join}\""
 => "\"Residencial Gaud\303\255\"" 
ree-1.8.7-2011.03 :012 > puts str
"Residencial Gaudí"
 => nil 
ree-1.8.7-2011.03 :013 > Yajl.load str
 => "Residencial Gaud\303\255" 
ree-1.8.7-2011.03 :014 > puts Yajl.load str
Residencial Gaudí
 => nil 
ree-1.8.7-2011.03 :015 > str2 = "\"#{[99, 97, 195, 167, 97, 32, 112, 114, 97, 195, 167, 97, 32, 195, 170, 195, 169].map{|c| c.chr}.join}\""
 => "\"ca\303\247a pra\303\247a \303\252\303\251\"" 
ree-1.8.7-2011.03 :016 > puts str2
"caça praça êé"
 => nil 
ree-1.8.7-2011.03 :017 > Yajl.load str2
 => "ca\303\247a pra\303\247a \303\252\303\251" 
ree-1.8.7-2011.03 :018 > puts Yajl.load str2
caça praça êé
 => nil

Can you paste the bytes for the original (entire) JSON string itself, not just the part where the error was?

lucasallan commented 13 years ago

I'm thinking this is a problem in rails, because the problem only happens when the string is sent as a parameter in request. It is very hard debugger, because the exception happens before entering the create method

Started POST "/locations" for 10.0.0.2 at Wed Jun 08 13:12:04 -0300 2011 Error occurred while parsing request parameters. Contents:

Yajl::ParseError (lexical error: invalid bytes in UTF8 string. aa","latitude":-7,"name":"caça"}} (right here) ------^ ):

Stanley commented 13 years ago

I have a similar problem:

ruby-1.9.2-p180 :002 > Yajl::Parser.parse '{"żółty": "foo"}', :symbolize_keys => true
EncodingError: invalid encoding symbol
from /home/stan/.rvm/gems/ruby-1.9.2-p180/gems/yajl-ruby-0.8.2/lib/yajl.rb:37:in `parse'
from /home/stan/.rvm/gems/ruby-1.9.2-p180/gems/yajl-ruby-0.8.2/lib/yajl.rb:37:in `parse'
from (irb):4
from /home/stan/.rvm/rubies/ruby-1.9.2-p180/bin/irb:16:in `<main>'

ruby-1.9.2-p180 :003 > "żółty".bytes.to_a
 => [197, 188, 195, 179, 197, 130, 116, 121]
tc commented 13 years ago

I'm seeing a similar error on rails 3.1.0.rc5/yajl 0.8.2/ ruby 1.9.2p180.

I'm posting a JSON body with UTF8 characters.

JSON::ParserError (lexical error: invalid bytes in UTF8 string.
          of French writers such as St?phane Mallarm? and Joseph Joube
                     (right here) ------^
):
  translations of French writers such as St\xE9phane Mallarm\xE9 and Joseph Joubert.
larsgt commented 13 years ago

I've got a similar issue with yajl-ruby 0.8.3

rogerbraun commented 13 years ago

At least part of the problems should be solved with 0.8.3, see https://github.com/brianmario/yajl-ruby/pull/71

brianmario commented 13 years ago

@tc that string looks to be in the ISO-8859-1 encoding and JSON requires it to be in UTF-8. Can you transcode it into UTF-8 before handing it to yajl-ruby?

A quick way to check if a string is valid UTF-8 in 1.9.2 is to do this:

"some string".force_encoding('UTF-8').valid_encoding?

@larsgt - what is the string you're having trouble with?

larsgt commented 13 years ago

Pythons json tools also barfs on this string. I ended up cleaning up our database. 226 was the code of the bad character. There is what we ran to fix the string: [66, 97, 114, 226, 109, 44].pack("U*")

brianmario commented 13 years ago

closing this for now. basically the input must be valid utf-8 in order for yajl-ruby to be able to parse it correctly

igrigorik commented 12 years ago

Running into the same issue.. except my data source is github's timeline.. :-)

shas":[["652951d929f014eeaa6f3f01f5106d40ad97ea41","lukasz.milewski@gmail.com","Added JSON support","?ukasz Milewski",true]]

Results in:

Processing exception: lexical error: invalid bytes in UTF8 string.
          l.com","Added JSON support","?ukasz Milewski",true]],"ref":"
                     (right here) ------^

1) Suggestions for how to deal with this, short of dumping the entire input stream? 2) Looks like an encoding bug on github? /cc @tmm1

Commit event in above bug: https://github.com/IGED-UFPB/IGED/compare/96883dfb92...b5b6835788

In fact, fetching events from that repo shows plenty of same problems: https://api.github.com/repos/IGED-UFPB/IGED/events, ex:

message: "Corre??o na compara??o de duas Lias pela camada de Abstra??o."
brianmario commented 12 years ago

damn, I REALLY need to try to finish up 2.0 - unfortunately there isn't much I can do since yajl 1.x doesn't do any Unicode validation at all (and that's what we're using). we use charlock_holes to try and guess and transcode stuff into UTF-8 before encoding but sometimes there isn't enough data to make an accurate detection. Anyway this is definitely an issue we (GitHub) needs to deal with. Would you mind hitting up support@github and mention me? I'll do what needs to happen ;)

On Mar 11, 2012, at 9:24 AM, Ilya Grigorikreply@reply.github.com wrote:

Running into the same issue.. except my data source is github's timeline.. :-)

shas":[["652951d929f014eeaa6f3f01f5106d40ad97ea41","lukasz.milewski@gmail.com","Added JSON support","?ukasz Milewski",true]]

Results in:

Processing exception: lexical error: invalid bytes in UTF8 string.
         l.com","Added JSON support","?ukasz Milewski",true]],"ref":"
                    (right here) ------^

1) Suggestions for how to deal with this, short of dumping the entire input stream? 2) Looks like an encoding bug on github? /cc @tmm1


Reply to this email directly or view it on GitHub: https://github.com/brianmario/yajl-ruby/issues/64#issuecomment-4440034

igrigorik commented 12 years ago

Fired off an email to support - thanks Brian!

kitplummer commented 11 years ago

Just now running into the same issue. @igrigorik - I'm trying to use yajl-ruby to parse through your archive events too. :) What was the resolution?

igrigorik commented 11 years ago

@kitplummer that's odd.. I'm serializing it into those archives with Yajl - it should have bombed one step before on my end. If the archive is \n delimited (depends on the date range), you can do the "read one line, parse one line" trick.. and rescue the exception and skip.