Closed GoogleCodeExporter closed 9 years ago
Ok, I think SnakeYAML is handling the input properly, but not outputting
properly.
According to the YAML specs for 1.1 and 1.2:
"On output, a YAML processor must only produce these acceptable characters, and
should also escape all non-printable Unicode characters."
In the above case, it is clearly not emitting the escaped version of the
HAMBURGER character, and is instead emitting the actual utf-8 sequence into the
output. It then blows up attempting to load those characters, since
StreamReader properly errors out on non-printable characters in the YAML stream.
Original comment by head...@headius.com
on 26 Sep 2012 at 7:14
This issue may be a duplicate of
http://code.google.com/p/snakeyaml/issues/detail?id=155 which appears to have
had a fix in the past month.
Is a 1.11 release pending any time soon? I will test 1.11 build in JRuby now.
Original comment by head...@headius.com
on 26 Sep 2012 at 7:21
I have tested JRuby with a HEAD build of SnakeYAML and the issue persists. It
does not appear the regular expression for catching unprintable characters is
catching this case correctly.
Original comment by head...@headius.com
on 26 Sep 2012 at 7:42
It appears SnakeYAML is missing the unprintable mask for higher-plane
characters... specifically the range 0x10000-0x10FFFF shown in the YAML spec
for 32-bit character ranges.
This seems like the right fix, but it doesn't appear to match correctly:
diff -r 9126dde66e63 src/main/java/org/yaml/snakeyaml/reader/StreamReader.java
--- a/src/main/java/org/yaml/snakeyaml/reader/StreamReader.java Wed Sep 05
20:14:52 2012 +0200
+++ b/src/main/java/org/yaml/snakeyaml/reader/StreamReader.java Wed Sep 26
15:03:43 2012 -0500
@@ -29,8 +29,13 @@
* Reader: checks if characters are in allowed range, adds '\0' to the end.
*/
public class StreamReader {
- public final static Pattern NON_PRINTABLE = Pattern
- .compile("[^\t\n\r\u0020-\u007E\u0085\u00A0-\uD7FF\uE000-\uFFFD]");
+ public final static Pattern NON_PRINTABLE;
+ static {
+ StringBuilder builder = new
StringBuilder("[^\t\n\r\u0020-\u007E\u0085\u00A0-\uD7FF\uE000-\uFFFD")
+ .appendCodePoint(0x10000).append('-').appendCodePoint(0x10FFFF)
+ .append(']');
+ NON_PRINTABLE = Pattern.compile(builder.toString());
+ }
private String name;
private final Reader stream;
private int pointer = 0;
Original comment by head...@headius.com
on 26 Sep 2012 at 8:05
Note also that SnakeYAML is encoding 32-bit (i.e. higher-plane) characters
using two \u sequences rather than a single \U sequence, as stated in the YAML
spec:
Escaped 16-bit Unicode character:
[63] ns-esc-16-bit ::= ā\ā āuā ( ns-hex-digit x 4 )
Escaped 32-bit Unicode character:
[64] ns-esc-32-bit ::= ā\ā āUā ( ns-hex-digit x 8 )
Original comment by head...@headius.com
on 26 Sep 2012 at 8:09
Also, SnakeYAML appears to want to encode the entire string when any character
matches, when it should just encode the unprintable characters.
Original comment by head...@headius.com
on 26 Sep 2012 at 8:17
Ok, updates...
* SnakeYAML does encode only the characters that need encoding, via a check in
Emitter.writeDoubleQuoted.
* It encodes UTF-16 surrogate pairs as separate 16-bit escape sequences, which
is wrong.
* It only encodes when !allowUnicode.
So the following changes are needed:
* Always encode unprintable characters (YAML spec).
* Encode 32-bit characters (UTF-16 surrogate pairs) as \UXXXXXXXX, not as a
pair of \uXXXX.
This patch does both: https://gist.github.com/3796362
It appears to work for JRuby purposes, other than some minor behavior changes I
need to reconcile with Ruby's YAML tests.
Original comment by head...@headius.com
on 27 Sep 2012 at 8:47
Updated the gist to include a test that exercises the new code path (or at
least, it exercises the case of *valid* Unicode UTF-16 surrogate
sequences...invalid sequences will basically come out the same as they were
before).
https://gist.github.com/3796362
Original comment by head...@headius.com
on 29 Sep 2012 at 12:19
The patch has been accepted. Unfortunately the test does not cover the new code.
Please run 'mvn site' and check the coverage report for Emitter.java
(target/site/cobertura/index.html). The 'if' statement in line 1238 is never
'false' in tests.
Original comment by py4fun@gmail.com
on 29 Sep 2012 at 12:58
I have updated the gist with an additional test that covers the other code
branch I added.
Original comment by head...@headius.com
on 29 Sep 2012 at 3:20
It will be delivered in version 1.11
Original comment by py4fun@gmail.com
on 29 Sep 2012 at 6:50
Original issue reported on code.google.com by
head...@headius.com
on 26 Sep 2012 at 7:02