RDFLib / rdflib

RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information.
https://rdflib.readthedocs.org
BSD 3-Clause "New" or "Revised" License
2.17k stars 556 forks source link

Issues with `rdflib.Dataset` #2021

Open NeilGraham opened 2 years ago

NeilGraham commented 2 years ago

At the moment there are a few key issues with the current rdflib.Dataset, particularly in its handling of parsing Default Graph triples. I am very interested in working on a Pull Request to solve these issues since it will improve Named Graph support for all projects making use of rdflib.

Key issues with rdflib.Dataset:


By adding a new term rdflib.term.DefaultGraph, we can handle the serialization of these triples independent from Named Graph triples. Additionally, we should be able to access the default graph by specifying the following:

from rdflib import Dataset
from rdflib.term import DefaultGraph

ds = Dataset()
trig_string = (
    '<dg:s> <dg:p> <dg:o>. '
    'GRAPH <ng:g> { <ng:s> <ng:p> <ng:o>. }'
)
ds.parse(data=trig_string, format="trig")

default_graph = ds.graph(DefaultGraph)
for triple in default_graph:
    print(triple)
(rdflib.term.URIRef('dg:s'), rdflib.term.URIRef('dg:p'), rdflib.term.URIRef('dg:o'))
ghost commented 2 years ago

Indeed, it has been a long-standing issue, however, there is ongoing work which (happily) addresses all the issues you raise.

Although the PR is marked “closed”, that doesn't accurately reflect the reality. I had to delete and recreate my GH fork which also closed the PR and, as rebasing is preferred over merging, I've left the situation as is.

The changes are quite profoundly API-breaking from a usage perspective but the underlying semantics remain the same: Dataset.contexts() is changed to return Graph identifiers and Dataset.graphs() is introduced to return Graph objects. ConjunctiveGraph disappears, being completely replaced by Dataset - the core ConjunctiveGraph behaviour being preserved with Dataset(default_union=True).

For this reason, the Dataset re-work has to be part of a major version release (7.0, currently scheduled for October). When the core library is seeing fewer large-scale changes, I'll re-release the work as a draft PR.

aucampia commented 2 years ago

I do think at least some of these things can be fixed without breaking interfaces, and as discussed before, I think we should try and publish an ADR before we commit to new interface designs.

My current priorities are mainly related to improving RDFLib quality to get it to a level where I feel comfortable with using it in a production setting, and this includes: Adding type hints, fixing bugs, expanding test suite compliance. There are many issues, and it is always easier to some extent to start with a clean slate, but I feel we do have some obligation to users to manage our API with judiciously, because if we don't then we will essentially be abandoning semantic versioning. These interfaces came into RDFLib, and now we own them, and we have to make the best of it, and figure out a way to make things better while not shifting the burden to RDFLib users. I think what this really should indicate is how important it is to that our public API should be designed and evolved with great care.

As I indicated elsewhere, what I would like to see happen for interface re-architecthing:

  1. An ADR being published and accepted. This is important because changing our interface places a burden on our users, and we should do it as infrequently as possible, with the ideal being that we never do it.
  2. The ADR should consider all problems that need to be addressed with the old interface. For example, we have many cases where we have functions that break PEP-8 naming, however, these functions could have some other problems with it also, like LSP violations, and in that case we should not just fix the name of the function without also addressing other issues with the function, like LSP violations, otherwise we will just have to change the function again, and break our public interface again.
  3. Redesigned/New interfaces should be published as experimental under a private package (e.g. rdflib._experimental) so people can test it out before we commit to it.
  4. Notice should be given at least one release before we actually remove old interfaces.
ghost commented 2 years ago

I failed to find a position where ConjunctiveGraph and Dataset could co-exist without significant architectural disruption - but that's not really saying anything. The LSP violations are profound: ConjunctiveGraph subclasses Graph and Dataset subclasses ConjunctiveGraph - however, it was never intended to be treated as an "is_a" inheritance relationship but as a mixin.

ghost commented 2 years ago

Just to recap ... as regards the now-closed PR #1814, it was a follow-on from my initial exploratory rebase (#1646) of gromgull's "identifier-as-context" PR #958. The rebase revealed that more work was required in order to achieve a clean implementation, which I explored and managed to get (what I considered to be) the basis of a satisfactorily cleaner approach. However, that work has been largely obviated by the subsequent identification of significant Liskov Substitution Principle violations in the Graph/ConjunctiveGraph/Dataset/Store architecture which, if allowed to persist, would significantly compromise the advantages of static type-checking which is clearly already adding to the robustness and reliability of the library. However, these violations are also fundamental features of the published API and are not trivially addressable.