Blank-nodes collisions - Githubissues

RDFLib / rdflib

RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information.

https://rdflib.readthedocs.org

BSD 3-Clause "New" or "Revised" License

2.15k stars 554 forks source link

Blank-nodes collisions #980

Closed nleguillarme closed 2 years ago

nleguillarme commented 4 years ago

Hi.

If I understand correctly the graphs merging process explained here, the following piece of code should create a graph with two distinct blank nodes :

from rdflib import Graph

graph1 = """
_:0 <http://purl.obolibrary.org/obo/RO_0002350> <http://www.gbif.org/species/0000001> .
"""
graph2 = """
_:0 <http://purl.obolibrary.org/obo/RO_0002350> <http://www.gbif.org/species/0000002> .
"""

g = Graph()
g.parse(data=graph1, format="nt")
g.parse(data=graph2, format="nt")

for triple in g:
    print(triple)

However, when executing the code, I get the following output :

(rdflib.term.BNode('Ne3fd8261b37741fca22d502483d88964'), rdflib.term.URIRef('http://purl.obolibrary.org/obo/RO_0002350'), rdflib.term.URIRef('http://www.gbif.org/species/0000002')) (rdflib.term.BNode('Ne3fd8261b37741fca22d502483d88964'), rdflib.term.URIRef('http://purl.obolibrary.org/obo/RO_0002350'), rdflib.term.URIRef('http://www.gbif.org/species/0000001'))

Am I missing something ? (versions : rdflib 4.2.2, python 3.7.5)

white-gecko commented 4 years ago

I think you understand it correctly. I think this is related to issue #892 . The rdflib uses the blank identifiers as they are.

Changing this behavior now would break some things and as we are in the feature freeze for 5.x I moved it to the 6.0.0 milestone.

Actually I have a use case where I need to parse multiple files within the same context of blank identifiers. When executing SPARQL queries I need to have individual contexts per query. Maybe it would be a good idea to introduce some blank context object which can be handed over to the parse method and the query method. We have to put this on the roadmap for 6.0.0.

nleguillarme commented 4 years ago

Thank you for your reply. However I don't really understand... does that mean that there is no graph merging mechanism currently implemented in rdflib ? This would be in contradiction with what is said in the doc :

In RDFLib, blank nodes are given unique IDs when parsing, so graph merging can be done by simply reading several files into the same graph

https://rdflib.readthedocs.io/en/stable/merging.html

sanyam19106 commented 4 years ago

but both the graph have common subject and predicate and object is different.

vikash18086 commented 4 years ago

we solve this issue as follows, We take the new map through which we were assigning new ids to each new blank nodes of different graphs. If two blank nodes came from the same graph then we assign the same id. you can download the updated code from the URL #1101

white-gecko commented 4 years ago

@vikash18086 thank you for contributing to the RDFlib. I think this would not actually solve the issue. As I have mentioned earlier:

Actually I have a use case where I need to parse multiple files within the same context of blank identifiers. When executing SPARQL queries I need to have individual contexts per query. Maybe it would be a good idea to introduce some blank context object which can be handed over to the parse method and the query method. We have to put this on the roadmap for 6.0.0.

So we need some way to:

reference blank nodes across graphs within a dataset/conjunctive graph
allow to parse multiple documents within the same context of blank nodes
allow to parse files in different contexts

white-gecko commented 4 years ago

Cool thank you @mwatts15 for #1107 this is the interface as I have proposed it. I like it. We have to make sure that it also works across different serialization formats. I think it should not be a problem with Turtle, for RDF/XML the value of rdf:nodeID the same as the bnodeLabel following _: in Turtle and NTriples and JSON-LD is also using the _: syntax.

Also We need a similar solution for #892.

white-gecko commented 4 years ago

I'm currently not able to test #1107 and #1108. But As I see for #1108 the test do not yet reflect using the same context for different serialization formats. Also we need it for the other formats as well.

mwatts15 commented 4 years ago

@white-gecko I'm only really interested in the N-Triples and N-Quads formats.

As far as other parsers, you already get distinct blank nodes between different documents for some. I don't know if sharing them across documents makes as much sense for other formats. Turtle/N3 has more complicated handling of blank nodes: formulas define their own nested blank node contexts. What's the use-case for something like the bnode_context idea? The RDF/XML parser gives you distinct IDs for each parse unless you use preserve_node_ids - it just means "use the node ID as the BNode identifier". TriX also has preserve_node_ids although the TriX parser still creates BNodes like BNode(label) even when it's not "preserving" identifiers -- seems pretty useless.

JSON-LD looks like it would be more annoying in general, but also for this. I have less than zero interest in that.

white-gecko commented 4 years ago

That is fine. I'm actually also just interested in this feature for NTriples. But for the sake of consistency of the parsing interface I think it would be good to have the blank node/blank id support handled in the same way for all parsers. Maybe there will be somebody who needs it at some time … ;-)

ghost commented 2 years ago

Looks like #1108 fixes this issue (“Address remainder #980. Also add similar behavior for N-Quads.”) and so it can be closed?

nicholascar commented 2 years ago

Closing this Issue since PR #1495 includes a test that shows that this particular Issue is solved (due to PR #1108). Thanlks @gjhiggins!