aaubry / YamlDotNet

YamlDotNet is a .NET library for YAML
MIT License
2.48k stars 466 forks source link

YamlStream.Load with JSON with emojis (even escaped) fails: "While scanning a quoted scalar, found invalid Unicode character escape code." #838

Closed cmeeren closed 10 months ago

cmeeren commented 10 months ago

I am serializing something to JSON with System.Text.Json, and then converting it to YAML using YamlStream. However, if the JSON contains an emoji, even if it's escaped, YamlStream.Load throws:

(Line: 1, Col: 1, Idx: 0) - (Line: 1, Col: 4, Idx: 3): While scanning a quoted scalar, found invalid Unicode character escape code.

Code to reproduce:

open System.IO
open System.Text.Json
open YamlDotNet.RepresentationModel

let input = "๐Ÿ‘"

let json = JsonSerializer.Serialize input

assert (json = "\"\\uD83D\\uDC4D\"")

let yamlStream = YamlStream()
yamlStream.Load(new StringReader(json))
EdwardCooke commented 10 months ago

Those utf-8 codes are actually invalid codes for utf-8. According to Wikipedia anyways. I wonder if the dot net core json library sees your character as 2 separate characters because it might be utf-16. Iโ€™m not able to actually dig in to this too much right now, but Iโ€™ll try and take a closer look tonight. I wonder if you can specify the encoding in the json serializer and set it to utf-16 and see what happens.

cmeeren commented 10 months ago

UTF-16? But JSON is always UTF-8 by definition, isn't it? In any case, I can find nothing in System.Text.Json about UTF-16. Let me know if I'm wrong.

cmeeren commented 10 months ago

Also, AFAIK many emojis are composed of 2 or even more unicode characters. This SO question and answers may be helpful.

Finally, I highly doubt that System.Text.Json (the official .NET JSON serialization API) is doing things wrong. That would require very concrete proof.

EdwardCooke commented 10 months ago

This will also help me in getting a fix in

https://github.com/dotnet/runtime/issues/42847 Thereโ€™s some other linked issues that should help clarify the nuances in Unicode.

not sure when Iโ€™ll get it done though.

cmeeren commented 10 months ago

Is there any workaround I can apply now, for converting JSON to YAML? The emojis etc. doesn't have to be unescaped; I'm OK with anything that preserves the escape codes.

I simply want my JSON content converted to YAML (tweaked using a YAML visitor). Now it seems that the YAML conversion is attempting to decode escaped stuff, and is failing at that.

cmeeren commented 10 months ago

Possibly relevant: The Wikipedia article on JSON, section "Character encoding", says:

JSON exchange in an open ecosystem must be encoded in UTF-8. The encoding supports the full Unicode character set, including those characters outside the Basic Multilingual Plane (U+0000 to U+FFFF). However, if escaped, those characters must be written using UTF-16 surrogate pairs. For example, to include the Emoji character U+1F610 ๐Ÿ˜ NEUTRAL FACE in JSON:

{ "face": "๐Ÿ˜" }
// or
{ "face": "\uD83D\uDE10" }

The latter is exactly what System.Text.Json does. And since YAML is a superset of JSON, I would expect any YAML implementation, such as YamlDotNet, to support such surrogate pairs.

cmeeren commented 10 months ago

Ideally I would like unescaped output (i.e., emojis in the YAML), but at least the following workaround lets me preserve the surrogate pair escape codes:

let unicodeEscapeCodePlaceholder = Guid.NewGuid().ToString()

let escapeUnicodeEscapeCodes (str: string) =
    str.Replace(@"\u", unicodeEscapeCodePlaceholder)

let unEscapeUnicodeEscapeCodes (str: string) =
    str.Replace(unicodeEscapeCodePlaceholder, @"\u")

let formatAsYaml json =
    let json = escapeUnicodeEscapeCodes json
    let yaml = (* load with YamlStream and transform to YAML *)
    unEscapeUnicodeEscapeCodes yaml
EdwardCooke commented 10 months ago

I definitely do agree that utf8 surrogate pairs should be supported. Iโ€™m hoping to have some time this weekend to look at it. I did find where it was throwing a fit though.

cmeeren commented 10 months ago

Happy to hear it! ๐Ÿ˜

ecooke-macu commented 10 months ago

I've been doing a lot of research on this, through the unicode spec and all that. The surrogate pairs are a way of putting utf-16 and utf-32 characters in a utf-8 file. JSON does this by escaping them, in raw UTF-8 files, its done at the byte level (from my understanding). We just need to support the escaped version, just like the YAML spec shows (if I recall). It has a lot to deal with bit masking and what not, it's pretty complicated so it may take some time, but it may go quicker than I think. I'll let you know when I have a PR ready for it by linking it to this issue.

cmeeren commented 10 months ago

Thanks! I have no idea what the internals of YamlDotNet is doing, but wouldn't it work to just keep the loaded text as-is and not attempting to decode the escape sequences, which is seems like it's doing now?

EdwardCooke commented 10 months ago

This fix is going to be released in the latest nuget package which should be available in about 10-15 minutes.

cmeeren commented 10 months ago

It works. Thanks a lot! ๐Ÿ˜Š