SWI-Prolog / packages-semweb

The SWI-Prolog RDF store
28 stars 14 forks source link

Auto-generated blank node labels are syntactically invalid #68

Open wouterbeek opened 6 years ago

wouterbeek commented 6 years ago

When a dataset that contains blank nodes is processed in the Semantic Web standard libraries, blank nodes are assigned auto-generated labels. The URI representation of the file path from which the data is loaded forms part of these generated labels (see the example below). Unfortunately, forward slashes are not allowed in Turtle-family blank node label syntax. This means that Prolog blank node labels cannot be directly emitted in the process of generating a Turtle-family export or a SPARQL result set.

?- [library(semweb/rdf11)].
?- [library(semweb/turtle)].
?- rdf_load('vocab.trig', [format(trig)]).
?- rdf(S, P, O).
S = '_:file:///home/wbeek/git/Triply/cshapes/vocab.trig1',
P = 'http://www.w3.org/1999/02/22-rdf-syntax-ns#first',
O = 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type' ;

The predicates that performed batch exports, i.e., that export complete files at once, do turn these internal blank node labels into standards-compliant serialization output.

The problem remains with applications that stream through the data. Specifically, it is currently not possible to 'recode' an RDF dataset from one format into another using a statement-wide window. Renaming internal blank node labels to standard-compliant external blank node label requires an in-memory mapping (turtle.c uses a hash map for this) which can become arbitrarily long for arbitrary long data streams.

JanWielemaker commented 6 years ago

Well, the Turtle writer will rename Prolog's blank nodes into nice and short ones. I'm not a big fan of hashes a they make debugging hard. I see various options: make sure they never leak through standard protocols, use an encoding that can be reverted (e.g, the url-friendly base64 variant), so we can use portray or other tools to make them readable again or use a hash.

wouterbeek commented 6 years ago

Indeed, the writers fix this. I was writing atoms (bnodes and IRIs) to N-Triples directly, but that is a recipe for disaster :) Thanks for pointing that out.

wouterbeek commented 6 years ago

I've updated then issue to make clearer that blank node renaming is currently not implemented for streamed writers.