RDFLib / rdflib

RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information.
https://rdflib.readthedocs.org
BSD 3-Clause "New" or "Revised" License
2.16k stars 555 forks source link

BUG : jsonld parser does not randomise bnodes when using local _prefixed identifiers `"@id": "_:mybnode01"` #2760

Open marc-portier opened 6 months ago

marc-portier commented 6 months ago

Parsing this turtle:

_:b0 a <http://example.org/MyType> .
_:b1 a <http://example.org/MyType> .
_:x9 a <http://example.org/MyType> .

leads to the bnode local-identifiers (correctly) being replaced with generated uuid

. found 0 -> BNode item.n3()='_:n17eb88c2c4cf4557b23b9407db5723ffb1' 
. found 1 -> BNode item.n3()='_:n17eb88c2c4cf4557b23b9407db5723ffb2' 
. found 2 -> BNode item.n3()='_:n17eb88c2c4cf4557b23b9407db5723ffb3'

and (also correct) new ones at every run

While parsing the equivalent json-ld:

[
  {"@id": "_:b0", "@type": "http://example.org/MyType" },
  {"@id": "_:b1", "@type": "http://example.org/MyType" },
  {"@id": "_:x9", "@type": "http://example.org/MyType" } ]

will lead to

. found 0 -> BNode item.n3()='_:b0' 
. found 1 -> BNode item.n3()='_:b1' 
. found 2 -> BNode item.n3()='_:x9' 

Which actually extends the reach and life-time of these local identifiers far beyond their intended scope.

In practice: loading two distinct json-ld files which happen to use the same local bnode-identifiers into the same graph will effectively mix up the nodes from both.

Note: A similar issue was identified and fixed in rdflib.js --> https://github.com/linkeddata/rdflib.js/issues/555

marc-portier commented 6 months ago

in case you get bitten by this bug too: a dirty hack around this is just serializing to another format and use another rdflib parser that does not have this problem

def reparse(g: Graph, format="nt"):
    """This is a dirty hack workaround for issue https://github.com/RDFLib/rdflib/issues/2760
    It reproduces the graph by serializing and parsing it again
    Via an intermediate format (not jsonld!) that is known to work
    :param g: the graph to reparse
    :param format: the intermediate format to use
    """
    return Graph().parse(data=g.serialize(format=format), format=format)