biojppm / rapidyaml

Rapid YAML - a library to parse and emit YAML, and do it fast.
MIT License
572 stars 97 forks source link

Fails to parse plain flow (unquoted) scalars #60

Closed piorecki-simon closed 4 years ago

piorecki-simon commented 4 years ago

Tried to parse my file, but throws:

ERROR parsing yml: parse error - indentation should not increase at this point line 996

I am not sure if the file respects the yaml specification. I tried yaml-cpp too and it did work. Yaml-cpp took 52 seconds with 8% cpu usage with i9 9900k.

You can download / check:

https://developers.eveonline.com/resource/resources

Download the file "sde-TRANQUILITY.zip" or use this link

https://eve-static-data-export.s3-eu-west-1.amazonaws.com/tranquility/sde.zip

try to parse sde/fsd/typeIDs.yaml

biojppm commented 4 years ago

Thanks for reporting. Indeed this is likely a problem with ryml. Let me investigate; but please note that it may take me some days to get the time to investigate and (possibly) fix this.

biojppm commented 4 years ago

This file has a problem: there are many entries (notably the japanese entries) with plain flow (unquoted) scalars containing non legal sequences, eg :, :. For example, if you open https://yaml-multiline.info/, at the end of the page you will see:

Note: Plain flow scalars are picky about the : and # characters. They can be in the string, but : cannot appear before a space or newline, and # cannot appear after a space or newline; doing this will cause a syntax error. If you need to use these characters you are probably better off using one of the quoted styles instead.

There are also some cases with - at the beginning of the line, which is problematic for the parser, but I think is a legal situation which currently ryml does not handle.

Note that although the file is illegal YAML, some parsers and linters can deal with these cases. I'm going to (within reason) make an effort to accommodate all or most of these nonlegal cases. But in the meantime if you make sure all scalars are quoted you should be able to get ryml to parse the file. Is the file yours, and is that something you can do?

On a related note, parsing of unquoted scalars is slower, as it must be done in a look-ahead fashion, so if you are able to have quotes everywhere you will also profit from a faster parse time.

piorecki-simon commented 4 years ago

The file is not mine. It is a resource file for developers from a well known game which lists almost all items and its related static data, e.g. description in several languages. Therefore I thought that the YAML will meet the specification.

I am not a professional and I do not know much about YAML specifications. The only thing I could do, is to set quotes manually. However, if a new version of the file is released, I would have to set those quotes manually again.

Another solution which comes to my mind is a conversion from YAML to JSON but I am not sure if this will be better in the end.

So, do you know a handy way to deal with those kind of files. I just want to get the information ;) .

biojppm commented 4 years ago

The file is not mine. It is a resource file for developers from a well known game which lists almost all items and its related static data, e.g. description in several languages.

Yes, I figured so.

Another solution which comes to my mind is a conversion from YAML to JSON but I am not sure if this will be better in the end.

It still has to be parsed from YAML.

So, do you know a handy way to deal with those kind of files.

As I conveyed above, I'm adjusting ryml so that it handles these cases, which despite being non-conformant are common enough to be (within reason) worth the effort. Hopefully that will be ready in the next couple of days.

piorecki-simon commented 4 years ago

Thank you very much for your effort. Although this file may not meet the YAML specifications correctly, I hope it is a good test example for your library ~

biojppm commented 4 years ago

The current state of PR #61 is finally parsing this file without any errors or triggering of assertions. Note I've not done any correctness checks yet, so something may still be off. And it's not ready yet; I have to improve coverage and do some final cleanup.

Nevertheless, the results are very encouraging: in my laptop with a i7-8650 @ 1.90GHz, parsing that file takes 0.55s in Release and 17s in Debug mode. So apparently a 100x speedup (?)... Seems too good to be true. I'm curious - was the 55s time you reported for yaml-cpp in Debug or Release?

Anyway, here's the breakdown of the timings, taken with a ryml test utility:

# Release mode
$ linux-x86_64-clangxx10.0-Debug/test/ryml-parse-emit ~/proj/large_yaml_files/typeIDs.yaml > /dev/null
44ms: read_file
51ms: count_lines
reserving #lines=1348510
1.3e+02ms: tree_reserve
5.5e+02ms: parse_yml
3.6e+02ms: emit_to_buffer
0.0079ms: print_stdout
1.1e+03ms: objects
1.1e+03ms: TOTAL

# Debug mode
$ linux-x86_64-clangxx10.0-Debug/test/ryml-parse-emit ~/proj/large_yaml_files/typeIDs.yaml > /dev/null
45ms: read_file
7.8e+02ms: count_lines
reserving #lines=1348510
8.7e+02ms: tree_reserve
1.7e+04ms: parse_yml
6.9e+03ms: emit_to_buffer
0.011ms: print_stdout
2.5e+04ms: objects
2.5e+04ms: TOTAL
biojppm commented 4 years ago

@Merkaber It is finally merged to master. The result seems to be correct, but the file is large so please confirm (and also the time you reported - is it correct?). And feel free to reopen if there's still anything to fix.

piorecki-simon commented 4 years ago

@biojppm The file parsed without any errors. However, I did not check the actual data yet.

It was my own mistake that I used debug mode all the time. I am not confident with cmake and testing and don't know how to use your test files. Therefore I just did the following:

std::ifstream in(path); std::string contents((std::istreambuf_iterator<char>(in)), std::istreambuf_iterator<char>()); std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now(); ryml::Tree tree = ryml::parse(c4::to_substr(&contents[0])); std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();

Release rapidyaml: 645 milliseconds. For yaml-cpp with the same approach: 6186 milliseconds

Debug rapidyaml: 17913 milliseconds yaml-cpp: 52104 milliseconds

Tested on Windows 10 Home with i9-9900k @ 4.00GHz

Your library is very fast. Thank you very much. Keep it up!