SnakeYAML can't load higher-plane unicode characters

GoogleCodeExporter commented 9 years ago

In JRuby, we had an issue reported where YAML could be dumped containing 
higher-plan unicode characters like HAMBURGER, U+1F354), but could not load 
them.

Example from JRuby:

system ~/projects/jruby $ jruby -ryaml -e "p YAML.dump(\"\u{1f354}\")"
"--- ! \"🍔\"\n"

system ~/projects/jruby $ jruby -ryaml -e "p 
YAML.load(YAML.dump(\"\u{1f354}\"))"
Psych::SyntaxError: (<unknown>): <reader> unacceptable character '?' (0xD83C) 
special characters are not allowed
in "<reader>", position 7 at line 0 column 0
         parse at org/jruby/ext/psych/PsychParser.java:197
  parse_stream at /Users/headius/projects/jruby/lib/ruby/1.9/psych.rb:203
         parse at /Users/headius/projects/jruby/lib/ruby/1.9/psych.rb:151
          load at /Users/headius/projects/jruby/lib/ruby/1.9/psych.rb:127
        (root) at -e:1

(the blank in the first command's output is a HAMBURGER character)

It seems like perhaps SnakeYAML is not considering the possibility of surrogate 
characters in the String it parses, and only reads the first half of the 
HAMBURGER character. It should be checking characters for surrogacy and using 
Unicode codepoint logic if necessary.

I will see if I can find where in code this needs to happen.

Original issue reported on code.google.com by head...@headius.com on 26 Sep 2012 at 7:02

GoogleCodeExporter commented 9 years ago

Ok, I think SnakeYAML is handling the input properly, but not outputting 
properly.

According to the YAML specs for 1.1 and 1.2:

"On output, a YAML processor must only produce these acceptable characters, and 
should also escape all non-printable Unicode characters."

In the above case, it is clearly not emitting the escaped version of the 
HAMBURGER character, and is instead emitting the actual utf-8 sequence into the 
output. It then blows up attempting to load those characters, since 
StreamReader properly errors out on non-printable characters in the YAML stream.

Original comment by head...@headius.com on 26 Sep 2012 at 7:14

GoogleCodeExporter commented 9 years ago

This issue may be a duplicate of 
http://code.google.com/p/snakeyaml/issues/detail?id=155 which appears to have 
had a fix in the past month.

Is a 1.11 release pending any time soon? I will test 1.11 build in JRuby now.

Original comment by head...@headius.com on 26 Sep 2012 at 7:21

GoogleCodeExporter commented 9 years ago

I have tested JRuby with a HEAD build of SnakeYAML and the issue persists. It 
does not appear the regular expression for catching unprintable characters is 
catching this case correctly.

Original comment by head...@headius.com on 26 Sep 2012 at 7:42

GoogleCodeExporter commented 9 years ago

It appears SnakeYAML is missing the unprintable mask for higher-plane 
characters... specifically the range 0x10000-0x10FFFF shown in the YAML spec 
for 32-bit character ranges.

This seems like the right fix, but it doesn't appear to match correctly:

diff -r 9126dde66e63 src/main/java/org/yaml/snakeyaml/reader/StreamReader.java
--- a/src/main/java/org/yaml/snakeyaml/reader/StreamReader.java Wed Sep 05 
20:14:52 2012 +0200
+++ b/src/main/java/org/yaml/snakeyaml/reader/StreamReader.java Wed Sep 26 
15:03:43 2012 -0500
@@ -29,8 +29,13 @@
  * Reader: checks if characters are in allowed range, adds '\0' to the end.
  */
 public class StreamReader {
-    public final static Pattern NON_PRINTABLE = Pattern
-            .compile("[^\t\n\r\u0020-\u007E\u0085\u00A0-\uD7FF\uE000-\uFFFD]");
+    public final static Pattern NON_PRINTABLE;
+    static {
+        StringBuilder builder = new 
StringBuilder("[^\t\n\r\u0020-\u007E\u0085\u00A0-\uD7FF\uE000-\uFFFD")
+                .appendCodePoint(0x10000).append('-').appendCodePoint(0x10FFFF)
+                .append(']');
+        NON_PRINTABLE = Pattern.compile(builder.toString());
+    }
     private String name;
     private final Reader stream;
     private int pointer = 0;

Original comment by head...@headius.com on 26 Sep 2012 at 8:05

GoogleCodeExporter commented 9 years ago

Note also that SnakeYAML is encoding 32-bit (i.e. higher-plane) characters 
using two \u sequences rather than a single \U sequence, as stated in the YAML 
spec:

Escaped 16-bit Unicode character:
[63]    ns-esc-16-bit   ::= “\” “u” ( ns-hex-digit x 4 )     
Escaped 32-bit Unicode character:
[64]    ns-esc-32-bit   ::= “\” “U” ( ns-hex-digit x 8 )

Original comment by head...@headius.com on 26 Sep 2012 at 8:09

GoogleCodeExporter commented 9 years ago

Also, SnakeYAML appears to want to encode the entire string when any character 
matches, when it should just encode the unprintable characters.

Original comment by head...@headius.com on 26 Sep 2012 at 8:17

GoogleCodeExporter commented 9 years ago

Ok, updates...

* SnakeYAML does encode only the characters that need encoding, via a check in 
Emitter.writeDoubleQuoted.
* It encodes UTF-16 surrogate pairs as separate 16-bit escape sequences, which 
is wrong.
* It only encodes when !allowUnicode.

So the following changes are needed:

* Always encode unprintable characters (YAML spec).
* Encode 32-bit characters (UTF-16 surrogate pairs) as \UXXXXXXXX, not as a 
pair of \uXXXX.

This patch does both: https://gist.github.com/3796362

It appears to work for JRuby purposes, other than some minor behavior changes I 
need to reconcile with Ruby's YAML tests.

Original comment by head...@headius.com on 27 Sep 2012 at 8:47

GoogleCodeExporter commented 9 years ago

Updated the gist to include a test that exercises the new code path (or at 
least, it exercises the case of *valid* Unicode UTF-16 surrogate 
sequences...invalid sequences will basically come out the same as they were 
before).

https://gist.github.com/3796362

Original comment by head...@headius.com on 29 Sep 2012 at 12:19

GoogleCodeExporter commented 9 years ago

The patch has been accepted. Unfortunately the test does not cover the new code.
Please run 'mvn site' and check the coverage report for Emitter.java 
(target/site/cobertura/index.html). The 'if' statement in line 1238 is never 
'false' in tests.

Original comment by py4fun@gmail.com on 29 Sep 2012 at 12:58

Changed state: Started

GoogleCodeExporter commented 9 years ago

I have updated the gist with an additional test that covers the other code 
branch I added.

Original comment by head...@headius.com on 29 Sep 2012 at 3:20

GoogleCodeExporter commented 9 years ago

It will be delivered in version 1.11

Original comment by py4fun@gmail.com on 29 Sep 2012 at 6:50

Changed state: Fixed

cuizhennan / snakeyaml

SnakeYAML can't load higher-plane unicode characters #158