Unicode characters outside of the basic multilingual plane get truncated when loaded

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Load a YAML document containing an escaped character outside of the basic 
multilingual plane.
2. Examine the resulting Java String 

What is the expected output? What do you see instead?

The expected result is that the Java String will contain the input character in 
Java's native UTF-16 encoding. The actual result is that the String contains 
the lower 16 bits of the character.

The YAML 1.0 and above specifications explicitly allow high codepoints (e.g. 
the 1.1 specification: <http://yaml.org/spec/1.1/#id868524>).

What version of SnakeYAML are you using? On what Java version?

snakeyaml 1.9 on Java 1.6

Please provide any additional information below. (Often a failing test is
the best way to describe the problem.)

Java code snippet:

Yaml y = new Yaml()
assert(y.load("\"\\U0001f648\"") == "\ud83d\ude48")

The following transcript of an interactive Scala session illustrates the 
problem:

Welcome to Scala version 2.8.1.final (Java HotSpot(TM) 64-Bit Server VM, Java 
1.6.0_29).
Type in expressions to have them evaluated.
Type :help for more information.

scala> import org.yaml.snakeyaml    
import org.yaml.snakeyaml

scala> val dumperOpts = new snakeyaml.DumperOptions()
dumperOpts: org.yaml.snakeyaml.DumperOptions = 
org.yaml.snakeyaml.DumperOptions@5f6f2b35

scala> dumperOpts.setAllowUnicode(false)

scala> val yaml = new snakeyaml.Yaml(dumperOpts)
yaml: org.yaml.snakeyaml.Yaml = Yaml:1585662705

scala> val seeNoEvil = "\"\\U0001f648\""
seeNoEvil: java.lang.String = "\U0001f648"

scala> val loaded = yaml.load(seeNoEvil)
loaded: java.lang.Object = 

scala> // Expected "\ud83d\ude48"

scala> val dumped = yaml.dump(loaded)
dumped: java.lang.String =
"\uf648"

scala> // Expected "\U0001f648"

Original issue reported on code.google.com by joshh...@gmail.com on 3 Jan 2012 at 11:43

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

Why do you explicitly do "dumperOpts.setAllowUnicode(false)" ?

Original comment by py4fun@gmail.com on 4 Jan 2012 at 12:40

GoogleCodeExporter commented 9 years ago

I set that option just to make the resulting "\uf648" easier to read.

Original comment by joshh...@gmail.com on 4 Jan 2012 at 1:04

GoogleCodeExporter commented 9 years ago

I have added the test: 
http://code.google.com/p/snakeyaml/source/browse/src/test/java/org/yaml/snakeyam
l/issues/issue137/SupplementaryCharactersTest.java

Original comment by py4fun@gmail.com on 6 Jan 2012 at 9:18

Changed state: Started

GoogleCodeExporter commented 9 years ago

Please check the latest source. It should be fixed now.

Original comment by py4fun@gmail.com on 6 Jan 2012 at 12:09

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

Feel free to try the latest 1.10-SNAPSHOT: 
https://oss.sonatype.org/content/groups/public/org/yaml/snakeyaml/1.10-SNAPSHOT/

The fix will be delivered in version 1.10

Original comment by py4fun@gmail.com on 9 Jan 2012 at 9:11

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

The test has been changed to fix issue 155

Original comment by py4fun@gmail.com on 5 Sep 2012 at 8:10

UcasRichard / snakeyaml

Unicode characters outside of the basic multilingual plane get truncated when loaded #137