aantono / protobuf-java-format

Automatically exported from code.google.com/p/protobuf-java-format
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

Error in parsing JSON document containing Chinese characters. #32

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
JSON document:

{
                 "username":"検索jan5検索.8@test.relay.symantec.com"
                 ,"password":"12341234"
                 ,"display_name":"1231212"
                 ,"country":"US"
}

parses username field as: [“jan5”.8@test.relay.symantec.com]

Original issue reported on code.google.com by aant...@gmail.com on 27 Apr 2011 at 10:19

GoogleCodeExporter commented 9 years ago
What version did you use (or SVN revision)?

Original comment by philippe...@gmail.com on 2 May 2011 at 2:11

GoogleCodeExporter commented 9 years ago
The version used is 1.1.1

This is the email I got from the person using it:

------------------

Hi Alex,

I used your code to convert my inputStream to a String before calling 
JsonFormat.merge, but I still have the same problem with the object after 
returning.  I verified that the string argument has the Chinese characters that 
were in the input stream.

I stepped through some of the JsonFormat code and processes the token in 
consumeByteString:

        /**
         * If the next token is a string, consume it and return its (unescaped) value. Otherwise,
         * throw a {@link ParseException}.
         */
        public String consumeString() throws ParseException {
            return consumeByteString().toStringUtf8();
        }

        /**
         * If the next token is a string, consume it, unescape it as a
         * {@link com.google.protobuf.ByteString}, and return it. Otherwise, throw a
         * {@link ParseException}.
         */
        public ByteString consumeByteString() throws ParseException {
            char quote = currentToken.length() > 0 ? currentToken.charAt(0) : '\0';
            if ((quote != '\"') && (quote != '\'')) {
                throw parseException("Expected string.");
            }

            if ((currentToken.length() < 2)
                || (currentToken.charAt(currentToken.length() - 1) != quote)) {
                throw parseException("String missing ending quote.");
            }

            try {
                String escaped = currentToken.substring(1, currentToken.length() - 1);
                ByteString result = unescapeBytes(escaped);

The function unescapeBytes treats it as a byte string, so the characters get 
lost because they aren’t contained in single bytes.  Do you know why it 
should be treating the token as a byte-string?  I think this is the essence of 
the problem.

Original comment by aant...@gmail.com on 3 May 2011 at 4:37

GoogleCodeExporter commented 9 years ago
This was fixed by patch for issue 11. The method "unescapeBytes" is no longer 
used for parsing strings.

Either use trunk or wait for the next release to get the fix.

Original comment by philippe...@gmail.com on 3 May 2011 at 12:39

GoogleCodeExporter commented 9 years ago
Philippe / Alex, please add unit test to verify this issue on trunk.

I'm re-opening this issue till verification. 

Original comment by eliran.bivas on 3 May 2011 at 12:49

GoogleCodeExporter commented 9 years ago
Alex can confirm, but he added a unit test in r61. I'll let Alex close the 
issue when he confirms this.

Original comment by philippe...@gmail.com on 3 May 2011 at 12:54

GoogleCodeExporter commented 9 years ago
Code reviewed, Alex - Close if you think this issue is fixed.

Original comment by eliran.bivas on 3 May 2011 at 1:23

GoogleCodeExporter commented 9 years ago
Looks like the latest trunk has fixed the issue.  Unit test is in place to 
verify that future changes won't break it.

Original comment by aant...@gmail.com on 4 May 2011 at 4:18