biojppm / rapidyaml

Rapid YAML - a library to parse and emit YAML, and do it fast.
MIT License
583 stars 100 forks source link

Parsing YAML with unquoted \0 characters #439

Closed stohrendorf closed 4 months ago

stohrendorf commented 5 months ago

This is the code which loads the YAML:

      std::ifstream file{filename, std::ios::in | std::ios::binary};
      gsl_Assert(file.is_open());
      file.seekg(0, std::ios::end);
      const auto size = static_cast<std::streamsize>(file.tellg());
      file.seekg(0, std::ios::beg);

      m_buffer.resize(size);
      file.read(m_buffer.data(), size);
      m_tree = ryml::parse_in_arena(c4::to_csubstr(m_buffer));

This is the YAML is should parse, including a newline at the end:

meta:
  title: "Croft Arena"
  author: "JaayStation, SOTOS"
  downloadSoundtrack: false
  urls:
    - "https://drive.google.com/file/d/1fJ1Z3Ktr2VlPK1MD5OB4tY7uDnqyRMEQ/view?usp=sharing"

However, in version 0.7.0, rapidyaml throws an error that it's missing a colon after the key at line 7, column 2, which is an empty line, and beyond the size of the buffer. It works fine with 0.6.0, so I'm wondering if there's a breaking change regarding the new parser I don't fully grasp, or if it's an actual bug regarding handling of null terminators. When I remove the std::ios::binary, it complains about column 7, which indicates the latter, as the file has 7 lines with \r\n terminators, and the debugger showed me that the buffer contained only plain \n's.

biojppm commented 5 months ago

I can't reproduce at all, and the YAML you provided is parsing correctly in my linux machine. Your report is confusing, and missing fundamental information:

Please provide a minimum verifiable example, with the exact YAML source embedded as a string in C++ code. Otherwise, I can't help you.

FWIW, this is what I tried:

void test_439(csubstr yaml)
{
    test_check_emit_check(yaml, [](Tree const &t){
        ConstNodeRef n = t["meta"];
        EXPECT_EQ(n["title"], "Croft Arena");
        EXPECT_EQ(n["author"], "JaayStation, SOTOS");
        EXPECT_EQ(n["downloadSoundtrack"], "false");
        EXPECT_EQ(n["urls"][0].val(), "https://drive.google.com/file/d/1fJ1Z3Ktr2VlPK1MD5OB4tY7uDnqyRMEQ/view?usp=sharing");
    });
}
void test_439_ostream(csubstr yaml)
{
    {
        std::ofstream ofile("adsasd.bin", std::ios::out | std::ios::binary);
        ofile.write(yaml.str, (std::streamsize)yaml.len);
    }
    std::vector<char> m_buffer;
    {
        std::ifstream file("adsasd.bin", std::ios::in | std::ios::binary);
        ASSERT_TRUE(file.is_open());
        file.seekg(0, std::ios::end);
        const std::streamsize size = static_cast<std::streamsize>(file.tellg());
        file.seekg(0, std::ios::beg);
        m_buffer.resize((size_t)size);
        file.read(&m_buffer[0], size);
    }
    EXPECT_EQ(to_csubstr(m_buffer), yaml);
    test_439(to_csubstr(m_buffer));
    // try with trailing \0
    m_buffer.resize(m_buffer.size() + 10);
    EXPECT_TRUE(to_csubstr(m_buffer).last(10).trim('\0').empty());
std::cout << to_csubstr(m_buffer);
    test_439(to_csubstr(m_buffer));
}
TEST(github, 439_0)
{
    csubstr yaml = ""
        "meta:\n"
        "  title: \"Croft Arena\"\n"
        "  author: \"JaayStation, SOTOS\"\n"
        "  downloadSoundtrack: false\n"
        "  urls:\n"
        "    - \"https://drive.google.com/file/d/1fJ1Z3Ktr2VlPK1MD5OB4tY7uDnqyRMEQ/view?usp=sharing\"\n";
    test_439(yaml);
    test_439_ostream(yaml);
}
TEST(github, 439_1)
{
    csubstr yaml = ""
        "meta:\r\n"
        "  title: \"Croft Arena\"\r\n"
        "  author: \"JaayStation, SOTOS\"\r\n"
        "  downloadSoundtrack: false\r\n"
        "  urls:\r\n"
        "    - \"https://drive.google.com/file/d/1fJ1Z3Ktr2VlPK1MD5OB4tY7uDnqyRMEQ/view?usp=sharing\"\r\n";
    test_439(yaml);
    test_439_ostream(yaml);
}
stohrendorf commented 5 months ago

Hi, I'm sorry for missing some information. First, m_buffer is a std::vector<char>. I have in the meantime updated VS and the project dependencies, so it is not the same environment anymore, which is OFC bad. It does not crash anymore now when the buffer has the correct size and no trailing zeros, so I guess that was an error with a broken build directory. However, when the buffer is technically larger than the contained string, it still crashes. Without the std::ios::binary, the exact error message I get is:

ERROR: could not find ':' colon after key
7:7: 

The buffer at this point looks like this (only the trailing bytes as well as a newline): grafik

I would have expected an assertion somewhere that it's not parsing NUL bytes.

The source YAML, passed through xxd, is

00000000: 6d65 7461 3a0d 0a20 2074 6974 6c65 3a20  meta:..  title: 
00000010: 2243 726f 6674 2041 7265 6e61 220d 0a20  "Croft Arena".. 
00000020: 2061 7574 686f 723a 2022 4a61 6179 5374   author: "JaaySt
00000030: 6174 696f 6e2c 2053 4f54 4f53 220d 0a20  ation, SOTOS".. 
00000040: 2064 6f77 6e6c 6f61 6453 6f75 6e64 7472   downloadSoundtr
00000050: 6163 6b3a 2066 616c 7365 0d0a 2020 7572  ack: false..  ur
00000060: 6c73 3a0d 0a20 2020 202d 2022 6874 7470  ls:..    - "http
00000070: 733a 2f2f 6472 6976 652e 676f 6f67 6c65  s://drive.google
00000080: 2e63 6f6d 2f66 696c 652f 642f 3166 4a31  .com/file/d/1fJ1
00000090: 5a33 4b74 7232 566c 504b 314d 4435 4f42  Z3Ktr2VlPK1MD5OB
000000a0: 3474 5937 7544 6e71 7952 4d45 512f 7669  4tY7uDnqyRMEQ/vi
000000b0: 6577 3f75 7370 3d73 6861 7269 6e67 220d  ew?usp=sharing".
000000c0: 0a                                       .

You can see, as I said earlier, that there's an empty 7th line, as line 6 ends with a line break.

The stacktrace at this point is:

    croftengine.exe!c4::yml::error_impl(const char * msg, unsigned __int64 length, c4::yml::Location loc, void * __formal) Zeile 56 C++
    croftengine.exe!c4::yml::ParseEngine<c4::yml::EventHandlerTree>::_err<>(c4::basic_substring<char const> fmt) Zeile 455  C++
    croftengine.exe!c4::yml::ParseEngine<c4::yml::EventHandlerTree>::_handle_map_block() Zeile 6050 C++
    croftengine.exe!c4::yml::ParseEngine<c4::yml::EventHandlerTree>::parse_in_place_ev(c4::basic_substring<char const> filename, c4::basic_substring<char> src) Zeile 8089  C++
    croftengine.exe!c4::yml::parse_in_place(c4::yml::ParseEngine<c4::yml::EventHandlerTree> * parser, c4::basic_substring<char const> filename, c4::basic_substring<char> yaml, c4::yml::Tree * t, unsigned __int64 node_id) Zeile 40   C++
    croftengine.exe!c4::yml::parse_in_arena(c4::basic_substring<char const> yaml) Zeile 108 C++
>   croftengine.exe!serialization::YAMLDocument<1>::YAMLDocument<1>(const std::filesystem::path & filename) Zeile 68    C++
biojppm commented 4 months ago

Finally got some time to look at this again. Adding the trailing \0 characters does make the error show up.

However, \0 is not a valid character. Specifically,

To ensure readability, YAML streams use only the printable subset of the Unicode character set. The allowed character range explicitly excludes the C0 control block15 x00-x1F

So you have to ensure that the \0 character is not there, for example by doing this:

m_tree = ryml::parse_in_arena(c4::to_csubstr(m_buffer).trimr('\0'));
stohrendorf commented 4 months ago

I'm not complaining that RYML misbehaves when I'm giving it the wrong data which does not conform to the invariants the function requires. What I was expecting was a debug-only assert somewhere that validates that it gets data conforming to these invariants (possibly even guarded by something like RYML_STRICT_DEBUG_ASSERTS or so, so that debug performance doesn't drop unless needed). That would have also avoided this issue here as it would have let me check my own data first and fixing my code without you knowing about it, saving both of our time.

biojppm commented 4 months ago

Although presumably possible, such a feature is very low on the priority list. This is because it addresses only a small corner problem, and yet it is likely not a small task, once you start to account for tests, and all the places that this would have to be looked for.

If you are interested in this, feel free to submit a PR to address the problem. But note that the assertion infrastructure is already in place.