RDFLib / rdflib

RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information.
https://rdflib.readthedocs.org
BSD 3-Clause "New" or "Revised" License
2.15k stars 555 forks source link

Change some more internal usages of ConjunctiveGraph to Dataset to silence warnings #2867

Closed ashleysommer closed 2 months ago

ashleysommer commented 2 months ago

This was originally a very simple patch to change the use of ConjuctiveGraph to Dataset in the hextuples and nquads graph parsers.

The motivation was to silence the deprecation warnings that are emitted when simply parsing a nquads file or hextuples file, even if you were not personally using a ConjunctiveGraph, the warnings were being emitted for the use of ConjunctiveGraph within the parsers.

One of the key differences between ConjuctiveGraph and Dataset is how the default_graph works. When creating a new ConjunctiveGraph (cg), you can pass an identifier that becomes cg.identifier and also becomes the identifier of the default_graph. It uses a BNode as the identifier for the default graph if you don't pass one. A named graph with BNode as identifier is actually considered an "unnamed graph".

On the other hand, you cannot pass an identifier to a Dataset. The Dataset's internal identifier is always a new BNode, (that isn't even used for anything_. The identifier of ds.default_graph is the constant DATASET_DEFAULT_GRAPH_ID ("urn:x-rdflib:default"). So the difference is ConjunctiveGraph by default has an Unnamed graph as its default_graph, and Dataset has a Named Graph (with a URI) as its default_graph.

Changing ConjunctiveGraph to Dataset in the Hextuples and NQuads parsers inadvertently exposed an underlying existing bug in both parsers. If the destination graph (aka, the "sink") already had a Named default_graph assigned (like in the case of every Dataset), the parser would ignore that and parse into a new default_graph, causing one extra graph in the resulting dataset than you were expecting. That was fixed in this PR.

Another key difference between ConjuctiveGraph and Dataset is that ConjunctiveGraph always sets default_union to False unless you force it True, but Dataset is easily selectable, with the default False. So for compatibility sake, we set default_union to True on the usage of Dataset in these two parsers.

Finally, after those changes were made I started getting failing tests in the seemingly unrelated JSON-LD Serializer. This was caused by the change of nquads to use Dataset in the parser. Specifically, it was an "Unnamed default-graph" vs "Named default-graph" issue. There was an existing bug in the JSON-LD serializer that treated all named-graphs as "non-default" graphs, where it wraps them in a @graph{} wrapper and adds an @id with the named-graph identifier. All "Unnamed" graphs were treated as "default-graph" and were added to the top-level context in the JSON-LD file. The problem is it was treating the Dataset named graph with name DATASET_DEFAULT_GRAPH_ID as a "non-default" graph because it has a URI as its identifier. This was triggering a lot of JSON-LD serializer tests to fail because the expected output was different.

So this PR fixes that issue in the JSON-LD serializer too, so it now treats Dataset default_graph as an un-named graph. Note, this is directly related to https://github.com/RDFLib/rdflib/issues/2445 which still needs to be fixed.

coveralls commented 2 months ago

Coverage Status

coverage: 90.64% (-0.003%) from 90.643% when pulling 509c73c86dd335d54d2a611acddabde76d54588b on internal_conjunctive_graph into 0e7695c4a2add417a5d01df9b5fb566bcc61ed82 on main.

ashleysommer commented 2 months ago

I know the big wall of text in the original comment is hard to follow. The concepts are difficult, and the changes are hard to describe. ConjuctiveGraph vs Dataset usage is resulting in hidden bugs around the use of default_union, default_context, and identifier that are treated differently on Dataset than on ConjunctiveGraph.

Simply changing ConjunctiveGraph to Dataset is exposing these hidden bugs in unexpected areas, thankfully our test suit seems to be catching them. This issue is going to get worse as we approach full removal of ConjunctiveGraph, before it gets better.

The connection between the change in nquads parser, and the failing JSON-LD serializer tests, is because those tests read in the test data from the test suite on disk in nquads format. When the nquads parser changed from parsing using ConjunctiveGraph to parsing using Dataset, it begain using the Dataset's default_graph (DATASET_DEFAULT_GRAPH_ID), as the Named default graph, and that exposed the bug in JSON-LD where it could not handle that (seems like it was simply never tested to serialize a Dataset instance that uses the original default_graph.)