Closed GoogleCodeExporter closed 8 years ago
I had the same problem. Unfortunately the \xxx octal sequences will be encoded
on my
client side (GWT / JS) as Latin-1 Characters, so completely wrong.
I used the unescapeBytes function on my server side to convert back to the
original
characters. Then I write to an UTF-8 Stream. Now it works also on the client
side.
Original comment by janos.ko...@gmail.com
on 11 May 2010 at 6:35
I'm having the same problem. \xxx octal sequences aren't valid JSON and my json
parser is (quite rightly)
vomiting on them.
Original comment by jose...@gmail.com
on 18 May 2010 at 3:45
At the moment byte arrays can look like this:
"\123\124\125"
... Thats an invalid JSON escape sequence. Your parser parses it, but no
standards compliant parser will.
The code should be changed here:
http://code.google.com/p/protobuf-java-format/source/browse/trunk/protobuf-java-
format/src/java/com/google/protobuf/JsonFormat.java#1186
Original comment by jose...@gmail.com
on 18 May 2010 at 4:08
Based on the standard definition from http://www.json.org, the required change
is to the JsonFormat.escapeBytes() method:
byte b = input.byteAt(i);
...
default:
if (b >= 0x20) {
builder.append((char) b);
} else {
builder.append('\\u');
int code = Character.digit(b, 16);
String hex = Integer.toHexString(code);
builder.append(hex);
}
break;
Original comment by eliran.bivas
on 25 May 2010 at 1:41
mistakenly wrote
builder.append('\\u');
instead of
builder.append("\u");
Original comment by eliran.bivas
on 25 May 2010 at 2:31
I'm still getting some weirdness with the string encoding.
According to the JSON RFC, string literals can contain unicode characters
(except for some ctrl chars) OR be escaped (\uXXXX).
So encoding the value "é" as a JSON string can be either:
value:"é"
OR
value:"\u00E9"
Currently (r34) JsonFormat will first encode the string to UTF8 (which in our
case results in 2 bytes) then, it will encode each byte. When the byte is
outside the ASCII plane, it is encoded using its Unicode sequence (\uXXXX). As
such, the result is double-encoding of the value (first to UTF-8 then to
Unicode-32) so we get garbage on the client:
value:"\u00c3\u00a9"
where:
\u00c3 is the UTF-8 control character
\u00a9 is the second byte in the sequence
See section 2.5 of http://www.ietf.org/rfc/rfc4627.txt
Original comment by philippe...@gmail.com
on 2 Aug 2010 at 2:31
I've made the necessary change to escapeBytes() method.
Please review the latest revision and tell me if it's working for you.
Original comment by eliran.bivas
on 3 Aug 2010 at 9:04
Encoding "é" now results in "\uffc3\uffa9". Reading this back with JsonFormat
results in "sY".
I think the problem is that the escapeText() method converts the text string
into UTF-8. This should not happen at all. The escapeText method should encode
a CharSequence directly, not a ByteString (since JSON supports unicode).
I'll try to submit a patch for this. Thanks for the quick turnaround.
Original comment by philippe...@gmail.com
on 3 Aug 2010 at 2:55
Allright, here's a patch that encodes String values into JSON String as
described in the RFC.
* all control characters are encoded using \uXXXX except for the ones that
allow the short version \X (newline, formfeed, etc.)
* single quote is also encoded as \' even if the JSON RFC doesn't specify that
it can
* any other character in the 0x0000-0xFFFF range is printed as-is
* any character outside of the 0x0000-0xFFFF range is printed by encoding it's
UTF-16 surrogate pair (\uXXXX\uXXXX).
The patch also adds a few test cases.
Original comment by philippe...@gmail.com
on 3 Aug 2010 at 6:54
Attachments:
The test for this is to check if JSON.parse (from http://www.JSON.org/json2.js)
accepts it. Even if this piece of code is more strict that the spec, this is
the piece of code that most people use to parse JSON.
Reading towards the end, you can see how it validates:
1. replace /\\(?:["\\\/bfnt]|u[0-9a-fA-F]{4})/ with @.
- i.e. only \" \\ \/ \b \f \n \t and \uXXXX are allowed
2. Now, a string must match /"[^"\\\n]*"/
So, ... octal escapes are not accepted by this validator. Neither are \a \r \v
and \' which are generated by escapeBytes().
BTW, both the encoding of strings and byte sequences have this problem,
Original comment by kresten....@gmail.com
on 18 Aug 2010 at 4:07
The patch I submitted will not encode \a, and \v (as did escapeBytes()), but
will encode \r (as per the RFC). It will not do any octal escaping either.
Quickly looking at the json2.js source, I think it does accept \r.
The only thing that is not compliant in the patch I submitted is the encoding
of \' which seems to be accepted by browsers.
Now that I think about it, I think it was a bad idea to add to the patch. It
can be easily left out when applying the patch.
kresten.krab: Did you test the patch I submitted?
Original comment by philippe...@gmail.com
on 18 Aug 2010 at 4:22
Hi, I haven't had the chance to look on it but after reading your comment
about, i think it might need a change.
sorry for the late replay.
Original comment by eliran.bivas
on 19 Aug 2010 at 6:13
In v.1.1.1 escaping of special chars (e.g. german umlauts like ä, ö, ü, ß)
is still wrong.
When I apply philippe's patch (see issue-11.patch above) my test work. So I
don't know why this has not found its way into the trunk during the last 6
months.
Here's another patch for the current JsonFormat.java in v.1.1.1 (rev 43) which
is essentially phillippe's patch without the single quote escaping (can be
omitted, read previous comment).
Maybe this will find its way into the next version.
Original comment by stephan....@gmail.com
on 4 Apr 2011 at 1:25
Attachments:
I've been using a patched version in production for a while now and have not
had a single issue.
It was important to leave out the single-quote special treatment as mentioned.
So you should be fine with your patched version.
As a side note, our patched version also fixes issue 17. It's available here
(as a maven project)
http://maven.obiba.org/maven2/com/google/protobuf/protobuf-java-format/
Original comment by philippe...@obiba.org
on 4 Apr 2011 at 4:23
Good to know. One more reason why your patch should be included in the next
release...
Original comment by stephan....@gmail.com
on 4 Apr 2011 at 7:06
any of you like to merge the patch?
Original comment by eliran.bivas
on 6 Apr 2011 at 7:01
When will a new version be released with these important fixes?
Original comment by t.britz%...@gtempaccount.com
on 21 Apr 2011 at 2:04
Eliran: I'm willing to merge the patch. I can also merge the fix for Issue 17.
Original comment by philippe...@obiba.org
on 21 Apr 2011 at 2:21
Original comment by eliran.bivas
on 21 Apr 2011 at 2:32
Issue 16 has been merged into this issue.
Original comment by eliran.bivas
on 21 Apr 2011 at 2:33
Issue 18 has been merged into this issue.
Original comment by eliran.bivas
on 21 Apr 2011 at 2:35
Issue 31 has been merged into this issue.
Original comment by philippe...@gmail.com
on 26 Apr 2011 at 5:19
Patch was merged into r47. Note that the patch does not encode the single
quote, as per the RFC (see comments above).
Original comment by philippe...@gmail.com
on 26 Apr 2011 at 8:19
Issue 32 has been merged into this issue.
Original comment by philippe...@gmail.com
on 3 May 2011 at 12:39
maybe someone might find this comment helpful, too:
http://code.google.com/p/protobuf-java-format/issues/detail?id=10#c25
tephan
Original comment by stephan....@gmail.com
on 5 May 2011 at 5:03
maybe someone might find this comment helpful, too:
http://code.google.com/p/protobuf-java-format/issues/detail?id=10#c25
tephan
Original comment by stephan....@gmail.com
on 5 May 2011 at 5:04
Original issue reported on code.google.com by
oskar...@gmail.com
on 3 Mar 2010 at 2:03