drobilla / serd

A lightweight C library for RDF syntax
https://gitlab.com/drobilla/serd
ISC License
86 stars 15 forks source link

Colliding generated blank nodes during TriG import #19

Closed Toni100 closed 5 years ago

Toni100 commented 6 years ago

The following TriG import creates a blank node b1 which collides with the already existing one:

> serdi -i trig -s "_:b1 <http://example.org/p> [] ."
_:b1 <http://example.org/p> _:b1 .

The Turtle importer correctly avoids collision:

> serdi -i turtle -s "_:b1 <http://example.org/p> [] ."
_:B1 <http://example.org/p> _:b1 .

Tested with the branches master and serd1.

drobilla commented 5 years ago

Hrm, unfortunately the TriG test suite has test cases that break if this is fixed. I guess I'll have to change the test suite even more :/ (the W3C tests are hostile to streaming implementations in general, so there's other minor statement order and blank node ID fixes in the serd versions).

drobilla commented 5 years ago

Fixed in a71575b

Toni100 commented 5 years ago

Hi David, thanks for the fix.

I think in general the idea of "generating blank node labels" is incompatible with constant-memory streaming: To generate non-colliding labels in each situation you have to (in the worst case) keep track of all encountered and generated labels (at least I have not come up with a way around it). So my example above was more a hint, one could easily construct more examples that continue to fail. I'm mostly concerned about silently importing a file incorrectly, but this could be avoided: To detect a possible collision in a low-memory way one can check whether 1) any blank node of the form b1, b2, ... is encountered 2) and a blank node has been generated.

In this case abort.

Then there could be a "correct" mode that is keeping track of blank nodes and handles all cases - for use cases where a file needs to be imported, no matter the memory used.

drobilla commented 5 years ago

That is how serd works: if a clash would happen (both b and B prefixes encountered) it will error out.

I won't add a ton of specific code to maintain a data structure just for this. However in the next major version, serd will include a model, so normalisation of this would be possible if desired, assuming you have the memory to store the entire model.

The middle ground (you don't have enough memory for the whole model but maybe do just for some blank node index strikes me as not worth the effort. Easy enough to work around in other ways that don't boat serd too much for an edge case.

Toni100 commented 5 years ago

Thanks for the explanation and the outlook on version 1. Looking forward to testing it.

drobilla commented 5 years ago

y/w. This ultra-simple scheme has been Good Enough™ for quite some time now, but good catch on the TriG parser bug.

Note that you can be craftier if you need to be with the -c and -p options (or corresponding bits of the API). This is what I do in programs that load a bunch of files into the same place so that blank IDs don't clash, but are (partially) preserved (as a suffix) and recoverable if necessary.