aaubry / YamlDotNet

YamlDotNet is a .NET library for YAML
MIT License
2.48k stars 466 forks source link

Incorrect handling of carriage return `\r` and tab `\t` #867

Closed gregsdennis closed 7 months ago

gregsdennis commented 7 months ago

Describe the bug When parsing a string value that contains these escapes, the parser simply omits them.

To Reproduce Here's an NUnit test.

[Test]
public void Check()
{
    var yaml = Parse("\" \f\n\r\t\vabc \f\n\r\t\v\"");  // problem is here
    Console.WriteLine(Serialize(yaml));
}

public static YamlStream Parse(string yamlText)
{
    using var reader = new StringReader(yamlText);
    var yamlStream = new YamlStream();
    yamlStream.Load(reader);
    return yamlStream;
}

public static string Serialize(YamlStream yaml)
{
    var builder = new SerializerBuilder();
    var serializer = builder.Build();
    using var writer = new StringWriter();

    foreach (var document in yaml.Documents)
    {
        serializer.Serialize(writer, document.RootNode);
    }

    return writer.ToString();
}

Output is:

" \f\n\vabc \f\n\v"

You can even see in the debugger that the parsed value isn't right: image

EdwardCooke commented 7 months ago

I bet it’s how carriage return/line feeds are normalized that is causing the problem. And because the get normalized the tab character gets counted as white space and the subsequently dropped. Unfortunately I don’t have time at this moment to debug and get a fix in, the parser is pretty complicated. I’ll hunt down where it’s at and post a link though.

EdwardCooke commented 7 months ago

https://github.com/aaubry/YamlDotNet/blob/847230593e95750d4294ca72c98a4bd46bdcf265/YamlDotNet/Core/Scanner.cs#L191C16-L191C16

Basically it checks for any of those characters and counts it as a new line.

gregsdennis commented 7 months ago

I sse you actually have this test:

[Theory]
[InlineData("|\n  b-carriage-return\r  lll", "b-carriage-return\nlll")]
public void NewLinesAreParsedAccordingToTheSpecification(string yaml, string expected)
{
    AssertSequenceOfEventsFrom(Yaml.ParserForText(yaml),
        StreamStart,
        DocumentStart(Implicit),
        LiteralScalar(expected),
        DocumentEnd(Implicit),
        StreamEnd);
}

(other test cases removed)

It seems that maybe this is intended behavior? I'll check the spec.

I got this string from another spec that I'm trying to implement. It's possible they're just using bad YAML.

djmitche commented 7 months ago

@gregsdennis linked to https://yaml.org/spec/1.2.2/#54-line-break-characters in json-e/json-e#476. That section is about parsing, or perhaps more accurately tokenizing. It says that 0a0d, 0a, and 0d should be normalized to some single newline format when seen i the input -- even when within a scalar such as a multiline string value.

However, that section doesn't address parsing escapes in a string value (I'm sure that's covered elsewhere). And more to the point, it doesn't describe anything after the tokenization is complete. So if by whatever means a YAML input parses to a string containing ASCII characters CR, LF, or a consecutive CR and LF, this section does not at all apply to handling of that string value as the parsing is complete.

I suspect that the error in this bug report is in the input:

    var yaml = Parse("\" \f\n\r\t\vabc \f\n\r\t\v\"");  // problem is here

the C# parser is interpreting those escapes, so YAML is getting actual FF, CR, LF, TAB, etc. characters. I suspect that should be

    var yaml = Parse("\" \\f\\n\\r\\t\\vabc \\f\\n\\r\\t\\v\"");  // problem is here
gregsdennis commented 7 months ago

The actual text in question is found in a YAML file and is:

template: {$eval: "rstrip(' \f\n\r\t\vabc \f\n\r\t\v')"}

To my understanding, this is decoded as the actual whitespace chars.

https://yaml-online-parser.appspot.com/ converts this to JSON as:

{
  "template": {
    "$eval": "rstrip(' \f\n\r\t\u000babc \f\n\r\t\u000b')"
  }
}

which clearly preserves the characters.

Updating the test code above with (note the verbatim string @"" so escapes are right):

var json = JsonNode.Parse(@"{
  ""template"": {
    ""$eval"": ""rstrip(' \f\n\r\t\u000babc \f\n\r\t\u000b')""
  }
}");
Console.WriteLine(JsonSerializer.Serialize(json, new JsonSerializerOptions{Encoder = JavaScriptEncoder.UnsafeRelaxedJsonEscaping}));

the output for this is:

{"template":{"$eval":"rstrip(' \f\n\r\t\u000Babc \f\n\r\t\u000B')"}}
gregsdennis commented 7 months ago

I sorted this out by just not using the parser for reading these strings. Might still be an issue, but I'll close for now. If it comes up again, surely someone will report it again.