carlomedas / protobuf-java-format

Automatically exported from code.google.com/p/protobuf-java-format
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

Weird escaping of some characters #11

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. A string field in a message contains characters like à, ò, è, é (etc)
2. Look at the output of String jsonFormat =
jsonFormat.printToString(someProto);

What is the expected output? What do you see instead?
I want to see the original characters, not \xxx octal sequences.

What version of the product are you using? On what operating system?
protobuf-format-java-1.1.jar

Please provide any additional information below.
I tried to modify the method escapeBytes in JsonFormat.java but I can't see
my original characters.

Original issue reported on code.google.com by oskar...@gmail.com on 3 Mar 2010 at 2:03

GoogleCodeExporter commented 9 years ago
I had the same problem. Unfortunately the \xxx octal sequences will be encoded 
on my
client side (GWT / JS) as Latin-1 Characters, so completely wrong.
I used the unescapeBytes function on my server side to convert back to the 
original
characters. Then I write to an UTF-8 Stream. Now it works also on the client 
side. 

Original comment by janos.ko...@gmail.com on 11 May 2010 at 6:35

GoogleCodeExporter commented 9 years ago
I'm having the same problem. \xxx octal sequences aren't valid JSON and my json 
parser is (quite rightly) 
vomiting on them.

Original comment by jose...@gmail.com on 18 May 2010 at 3:45

GoogleCodeExporter commented 9 years ago
At the moment byte arrays can look like this:
"\123\124\125"
... Thats an invalid JSON escape sequence. Your parser parses it, but no 
standards compliant parser will.

The code should be changed here:
http://code.google.com/p/protobuf-java-format/source/browse/trunk/protobuf-java-
format/src/java/com/google/protobuf/JsonFormat.java#1186

Original comment by jose...@gmail.com on 18 May 2010 at 4:08

GoogleCodeExporter commented 9 years ago
Based on the standard definition from http://www.json.org, the required change 
is to the JsonFormat.escapeBytes() method:

        byte b = input.byteAt(i);
...
        default:
                if (b >= 0x20) {
                        builder.append((char) b);
                } else {
                        builder.append('\\u');
                        int code = Character.digit(b, 16);
                        String hex = Integer.toHexString(code);
                        builder.append(hex);
                }
                break;

Original comment by eliran.bivas on 25 May 2010 at 1:41

GoogleCodeExporter commented 9 years ago
mistakenly wrote 

builder.append('\\u'); 

instead of 

builder.append("\u");

Original comment by eliran.bivas on 25 May 2010 at 2:31

GoogleCodeExporter commented 9 years ago
I'm still getting some weirdness with the string encoding.

According to the JSON RFC, string literals can contain unicode characters 
(except for some ctrl chars) OR be escaped (\uXXXX).

So encoding the value "é" as a JSON string can be either:
value:"é"
OR
value:"\u00E9"

Currently (r34) JsonFormat will first encode the string to UTF8 (which in our 
case results in 2 bytes) then, it will encode each byte. When the byte is 
outside the ASCII plane, it is encoded using its Unicode sequence (\uXXXX). As 
such, the result is double-encoding of the value (first to UTF-8 then to 
Unicode-32) so we get garbage on the client:

value:"\u00c3\u00a9"
where:
\u00c3 is the UTF-8 control character
\u00a9 is the second byte in the sequence

See section 2.5 of http://www.ietf.org/rfc/rfc4627.txt

Original comment by philippe...@gmail.com on 2 Aug 2010 at 2:31

GoogleCodeExporter commented 9 years ago
I've made the necessary change to escapeBytes() method.

Please review the latest revision and tell me if it's working for you.

Original comment by eliran.bivas on 3 Aug 2010 at 9:04

GoogleCodeExporter commented 9 years ago
Encoding "é" now results in "\uffc3\uffa9". Reading this back with JsonFormat 
results in "sY".

I think the problem is that the escapeText() method converts the text string 
into UTF-8. This should not happen at all. The escapeText method should encode 
a CharSequence directly, not a ByteString (since JSON supports unicode).

I'll try to submit a patch for this. Thanks for the quick turnaround.

Original comment by philippe...@gmail.com on 3 Aug 2010 at 2:55

GoogleCodeExporter commented 9 years ago
Allright, here's a patch that encodes String values into JSON String as 
described in the RFC.

* all control characters are encoded using \uXXXX except for the ones that 
allow the short version \X (newline, formfeed, etc.)
* single quote is also encoded as \' even if the JSON RFC doesn't specify that 
it can
* any other character in the 0x0000-0xFFFF range is printed as-is
* any character outside of the 0x0000-0xFFFF range is printed by encoding it's 
UTF-16 surrogate pair (\uXXXX\uXXXX).

The patch also adds a few test cases.

Original comment by philippe...@gmail.com on 3 Aug 2010 at 6:54

Attachments:

GoogleCodeExporter commented 9 years ago
The test for this is to check if JSON.parse (from http://www.JSON.org/json2.js) 
accepts it.  Even if this piece of code is more strict that the spec, this is 
the piece of code that most people use to parse JSON.

Reading towards the end, you can see how it validates:

1. replace /\\(?:["\\\/bfnt]|u[0-9a-fA-F]{4})/ with @.    
       - i.e. only \" \\ \/ \b \f \n \t and \uXXXX are allowed

2. Now, a string must match /"[^"\\\n]*"/ 

So, ... octal escapes are not accepted by this validator.  Neither are \a \r \v 
and \' which are generated by escapeBytes().

BTW, both the encoding of strings and byte sequences have this problem,

Original comment by kresten....@gmail.com on 18 Aug 2010 at 4:07

GoogleCodeExporter commented 9 years ago
The patch I submitted will not encode \a, and \v (as did escapeBytes()), but 
will encode \r (as per the RFC). It will not do any octal escaping either.

Quickly looking at the json2.js source, I think it does accept \r.

The only thing that is not compliant in the patch I submitted is the encoding 
of \' which seems to be accepted by browsers.

Now that I think about it, I think it was a bad idea to add to the patch. It 
can be easily left out when applying the patch.

kresten.krab: Did you test the patch I submitted?

Original comment by philippe...@gmail.com on 18 Aug 2010 at 4:22

GoogleCodeExporter commented 9 years ago
Hi, I haven't had the chance to look on it but after reading your comment 
about, i think it might need a change.

sorry for the late replay.

Original comment by eliran.bivas on 19 Aug 2010 at 6:13

GoogleCodeExporter commented 9 years ago
In v.1.1.1 escaping of special chars (e.g. german umlauts like ä, ö, ü, ß) 
is still wrong.

When I apply philippe's patch (see issue-11.patch above) my test work. So I 
don't know why this has not found its way into the trunk during the last 6 
months.

Here's another patch for the current JsonFormat.java in v.1.1.1 (rev 43) which 
is essentially phillippe's patch without the single quote escaping (can be 
omitted, read previous comment).

Maybe this will find its way into the next version.

Original comment by stephan....@gmail.com on 4 Apr 2011 at 1:25

Attachments:

GoogleCodeExporter commented 9 years ago
I've been using a patched version in production for a while now and have not 
had a single issue.

It was important to leave out the single-quote special treatment as mentioned. 
So you should be fine with your patched version.

As a side note, our patched version also fixes issue 17. It's available here 
(as a maven project)

http://maven.obiba.org/maven2/com/google/protobuf/protobuf-java-format/

Original comment by philippe...@obiba.org on 4 Apr 2011 at 4:23

GoogleCodeExporter commented 9 years ago
Good to know. One more reason why your patch should be included in the next 
release...

Original comment by stephan....@gmail.com on 4 Apr 2011 at 7:06

GoogleCodeExporter commented 9 years ago
any of you like to merge the patch?

Original comment by eliran.bivas on 6 Apr 2011 at 7:01

GoogleCodeExporter commented 9 years ago
When will a new version be released with these important fixes?

Original comment by t.britz%...@gtempaccount.com on 21 Apr 2011 at 2:04

GoogleCodeExporter commented 9 years ago
Eliran: I'm willing to merge the patch. I can also merge the fix for Issue 17.

Original comment by philippe...@obiba.org on 21 Apr 2011 at 2:21

GoogleCodeExporter commented 9 years ago

Original comment by eliran.bivas on 21 Apr 2011 at 2:32

GoogleCodeExporter commented 9 years ago
Issue 16 has been merged into this issue.

Original comment by eliran.bivas on 21 Apr 2011 at 2:33

GoogleCodeExporter commented 9 years ago
Issue 18 has been merged into this issue.

Original comment by eliran.bivas on 21 Apr 2011 at 2:35

GoogleCodeExporter commented 9 years ago
Issue 31 has been merged into this issue.

Original comment by philippe...@gmail.com on 26 Apr 2011 at 5:19

GoogleCodeExporter commented 9 years ago
Patch was merged into r47. Note that the patch does not encode the single 
quote, as per the RFC (see comments above).

Original comment by philippe...@gmail.com on 26 Apr 2011 at 8:19

GoogleCodeExporter commented 9 years ago
Issue 32 has been merged into this issue.

Original comment by philippe...@gmail.com on 3 May 2011 at 12:39

GoogleCodeExporter commented 9 years ago
maybe someone might find this comment helpful, too: 
http://code.google.com/p/protobuf-java-format/issues/detail?id=10#c25

tephan

Original comment by stephan....@gmail.com on 5 May 2011 at 5:03

GoogleCodeExporter commented 9 years ago
maybe someone might find this comment helpful, too: 
http://code.google.com/p/protobuf-java-format/issues/detail?id=10#c25

tephan

Original comment by stephan....@gmail.com on 5 May 2011 at 5:04