drobilla / serd

A lightweight C library for RDF syntax
https://gitlab.com/drobilla/serd
ISC License
86 stars 15 forks source link

Add robustness against syntax errors? #7

Closed wouterbeek closed 6 years ago

wouterbeek commented 6 years ago

Based on a remark by @kurzum in an hdt-cpp issue, would it be make sense to support robustness against syntax errors to Serd?

Whenever a syntax error is identified, Serd could use some heuristics in order to determine where the next triple may start, and try to continue parsing from there.

kurzum commented 6 years ago

This is not possible for complex turtle. But if it is a one line per triple format with escaped \n skipping the line and continueing would be nice. Overall, I needed several weeks now to get the HDT working for DBpedia, i.e. merging several bz2 files into one .gz and compiling the develop branch.

Then I had to rezip everything again to let rapper filter out bad triples, which was just an extra uneccessary step.

wouterbeek commented 6 years ago

@kurzum I can feel your pain having to deal with dirty data! Linked Data Quality is an important topic.

WRT parser fallback strategies, the ClioPatria parser collection has fallback strategies for N-Triples and N-Quads, but also for Turtle and RDF/XML (and maybe for RDFa as well, but I'm not sure). This feature is very important in LOD Laundromat, where the vast majority of datasets on the web turn out to be syntactically malformed.

drobilla commented 6 years ago

Hm. I'm not sure. I sympathise, but this strikes me as a never-ending rabbit hole... that said, I'll look into these heuristics. If something relatively simple to implement works well and isn't problematic I will give it a shot.

drobilla commented 6 years ago

There are already some lax parsing facilities in serd and an existing option for that, so statement-level recovery could latch on to that. Worth noting that anything that requires backtracking can never work in serd (which is strictly a streaming parser), though, so a "perfectly" malformed file could probably cause massive chunks of the file to be skipped. Some condensed real test cases would definitely help with this if you could provide any.

kurzum commented 6 years ago

@wouterbeek can point you to the dirty and clean lodlaundromat files.

Skipping lines until a new well formed statement appears covers 90% of use cases for ntriples, Quads and maybe turtle.

Btw although streaming, keeping a "previous 5 line" window enables some backtracking. Not saying that it is worth the effort....

drobilla commented 6 years ago

https://github.com/drobilla/serd/commit/8d954ab071e286f0b2bdfc542bb3725eb5a2ab0e will skip to the next line when lax parsing. This works pretty well for the line-based formats (gets me through sketchy dbpedia dumps anyway), definitely will fail horribly in many cases for abbreviated syntaxes, but doing that well will require more work and an actual test suite and so on.

wouterbeek commented 6 years ago

Thank you for implementing the skip functionality for lax parsing mode.