certik / yaml-cpp

Automatically exported from code.google.com/p/yaml-cpp
MIT License
0 stars 0 forks source link

UTF16 multi-byte character parsing is incorrect #240

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
The unicode character 0x103A0, encoded in UTF16LE, is

\xD8\x00\xDF\xA0

yaml-cpp parses this as

\xEF\xBF\xBD\xF0\x90\x8E\xA0

Instead, it should parse it as

\xF0\x90\x8E\xA0

It appears to be dumping a replacement character in the stream first.

Note the tests EncodingTest.UTF16LE_noBOM, and other UTF16 ones, are disabled, 
since they all fail.

Original issue reported on code.google.com by jbe...@gmail.com on 24 Mar 2014 at 1:23

GoogleCodeExporter commented 9 years ago
Fixed, r2ba3ab63449f.

Richard, I just cc'd you to let you know about this, since you might be 
interested :)

I found it when I was porting the tests to gtest, and it turned out that there 
was a typo in the tests that caused them to pass, even when there was a 
mismatch in the output.

It looks like there was just a typo in the multi-byte parsing code also, which 
was pretty straightforward to fix. I guess not many people use UTF16 multi-byte 
characters :)

Original comment by jbe...@gmail.com on 24 Mar 2014 at 1:28

GoogleCodeExporter commented 9 years ago
The commit itself is r42a3de8d463ebcdf2175c22e5e2d25ac08e3eb43. Let's see if 
google code can figure out the link...

Original comment by jbe...@gmail.com on 24 Mar 2014 at 1:29