RDFLib / rdflib

RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information.
https://rdflib.readthedocs.org
BSD 3-Clause "New" or "Revised" License
2.15k stars 555 forks source link

Graph.__eq__ checks for graph identity, not for equivalent sets of triples #2105

Open tombaker opened 2 years ago

tombaker commented 2 years ago

Problem: In Pythonic style, I would expect that given graphs g and h, g == h would return True if each graph held the same set of triples. However, Graph.__eq__ actually checks whether the graphs have the same identity (ie, redundantly with g is h).

Example:

>>> g = Graph()
>>> g.add((URIRef("http://example.org/1"), RDF.type, SKOS.Concept))
<Graph identifier=N3bce35ed293b4c20956c2371431aeb2d (<class 'rdflib.graph.Graph'>)>
>>> h = Graph()
>>> for triple in g:
...     h.add(triple)
>>> set(g) == set(h)
True
>>> g == h
False

Might it be more useful for __eq__ to return True if the two graphs held the same set of triples?

ghost commented 2 years ago

Problem: In Pythonic style, I would expect that given graphs g and h, g == h would return True if each graph held the same set of triples. However, Graph.__eq__ actually checks whether the graphs have the same identity (ie, redundantly with g is h).

Might it be more useful for __eq__ to return True if the two graphs held the same set of triples?

The apparently trivial identity check is just one end of a scale and, in context, it may be more Pythonic than one might think. It follows the standard “id equality” approach of the default __eq__ operation (as usefully described on SO).

But the documentation could indeed do a better job of being explicit about the id equality semantics of Graph.__eq__() and g == h.

Going up the scale (of effort and semantic complexity), there is Graph.isomorphic(), the docstring for which is less terse and informs that the method performs a “basic check” of the equality of the set of triples in each graph --- but is accurate only if BNodes are not involved.

Then, even higher up the scale, there is rdflib.compare.isomorphic() (see docs) which is accurate for graphs containing BNodes but does not handle graphs containing N3 formula (the latter implemented as QuotedGraph in RDFLib).

For a technical insight into a current model of the issues, see Aidan Hogan's paper on RDF Canonicalization

With respect to Model Theory, the notion of "equivalence" is perhaps more useful because as, Hermann ter Horst observes in the W3-rdf list archive:

The RDF Concepts and Abstract Syntax document defines RDF graphs to be sets of triples. In this way, equality of RDF graphs is also defined.

Namely, by one of the first axioms of set theory, two sets are equal if and only if they have the same elements. Therefore, the definition of RDF graph equality given in the Concepts document would introduce contradictions: two RDF graphs which do not have exactly the same triples may be equal and not equal at the same time.

In terms of “exactly the same triples”, consider the “equality” of the triples:

[] foaf:givenName "Ramón" .

and

[] foaf:givenName "Ramón"@es .

Footnote

Thus far this is only considering assertional datasets, things get even more complicated when considering intensional statements. F'r instance, given this combination of assertional and intensional statements ...

@prefix log: <http://www.w3.org/2000/10/swap/log#> .
@prefix : <urn:example#> .

:Socrates a :Man.
:ChuckNorris :lives :MtOlympus .

@forAll :x .
    { :x a :Man } log:implies { :x a :Mortal } .
    { :x :lives :MtOlympus } log:implies { :x a :god } .
    { :x a :god } log:implies { :x :lives :MtOlympus } .
    { :x a :Man } log:implies { :x :drinks :whisky } .

the materialized assertional version, expressing the same information would be:

@prefix : <urn:example#> .

:Socrates a :Man.
:ChuckNorris :lives :MtOlympus .

:ChuckNorris a :god .
:Socrates a :Mortal ;
:drinks :whisky .

So, to a degree, the definition of “g == h” is a matter of context

HTH

tombaker commented 2 years ago

@gjhiggens

Thank you for the explanation and very useful links!

I certainly take your point that RDF semantics complicate the notion of equality between graphs, though my specific concerns are limited to how graphs are handled when expressed as rdflib.Graph instances and not to how those underlying graphs might be interpreted, canonicalized, or expanded with materialized triples, nor to any additional complexities related to other classes, such as QuotedGraph.

The apparently trivial identity check is just one end of a scale and, in context, it may be more Pythonic than one might think. It follows the standard “id equality” approach of the default eq operation (as usefully described on SO).

In my reading of the SO description, __eq__ is called wherever it has been implemented, and if it has not been implemented, only then are objects compared for identity. (That said, the phrase "else, finally" in point 4 reminds me of the inconsistent uses of else in Python if, for, try, and while statements, so it is possible that I am missing the point.) At any rate, my understanding seems consistent with how equality comparisons are handled for things like lists and, more to the point, for sets:

>>> p = {1, 2, 3}
>>> q = {1, 2, 3}
>>> p == q
True
>>> id(p) == id(q)
False

RDF 1.1 Concepts says that an RDF graph is a set of RDF triples, and RDFLib "aims to be a pythonic RDF API" where "RDFLib graphs are un-sorted containers; they have ordinary Python set operations (e.g. add() to add a triple) plus methods that search triples and return them in arbitrary order."

As per the Principle of Least Astonishment, in my reading, Graph instances should behave like sets, as they do in my example above, where set(g) == set(h) evaluates as True.

In terms of "exactly the same triples", consider the "equality" of the triples: [] foaf:givenName "Ramón" . and [] foaf:givenName "Ramón"@es .

Python documentation says that "Two sets are equal if and only if every element of each set is contained in the other (each is a subset of the other)", though I would not expect the two triples above to be considered equal for the purposes of a pythonic equality check. Indeed, as I would expect:

>>> bob = URIRef("http://example.org/people/Bob")
>>> linda = BNode()
>>> g = Graph()
>>> g.add((bob, FOAF.knows, linda))
>>> g.add((linda, FOAF.name, Literal("Linda")))
>>> k = Graph()
>>> k.add((bob, FOAF.knows, linda))
>>> k.add((linda, FOAF.name, Literal("Linda", lang="en")))
>>> set(g) == set(k)
False

I would, however, consider two graphs to evaluate as equal, even with BNodes, as long as those BNodes were identified with exactly the same identifiers:

>>> h = Graph()
>>> for triple in g:
...     h.add(triple)
>>> set(g) == set(h)
True
>>> id(g) == id(h)
False

To me, basing Graph comparison on set comparison would follow the Principle of Least Astonishment. But if that ship has sailed, I agree it would be useful for RDFLib documentation to have a sentence or two explaining that graph comparison is based on identity, and why.

ghost commented 2 years ago

RDFLib 0.9.4 was released a year earlier than the W3-rdf list discussion I referenced and, at that time, the precursor to Graph, AbstractStore had __eq__ defined thusly:

def __eq__(self, other):
        # Note: this is not a test of isomorphism, but rather exact
        # equality.
        if len(self)!=len(other):
            return 0
        for s, p, o in self:
            if not other.exists(s, p, o):
                return 0
        for s, p, o in other:
            if not self.exists(s, p, o):
                return 0
        return 1

This definition persisted through the next 5 years (the early RDFLib 1.x-2.x versions) during which Graph was introduced and which Graph inherited, up until the 2007 RDFLib 2.4.0 when the __eq__ operator was removed at the same time that isomorphic was added. It remained that way until 2011 when __eq__ was reintroduced.

Given that what was re-inroduced in that last commit appears to be effectively superfluous code, maybe that particular change to Graph will be reverted in 0.7.

FWIW, Gavin Carother's revamp of pymantic leaves __eq__ unimplemented for Graph.

tombaker commented 2 years ago

@gjhiggins Many thanks for the historical background!

Given that what was re-inroduced in that last commit appears to be effectively superfluous code, maybe that particular change to Graph will be reverted in 0.7.

What do you mean by '0.7' - is it a development branch in the works? (The latest release appears to be 6.3.0a0.) And are you saying that the revert would remove __eq__ entirely, leaving it unimplemented, as in pymantic? And you say "maybe" - is the revert under discussion somewhere?

FWIW, I like the AbstractStore.__eq__ listed above, though today's Graph does not have an exists method. I guess it could be implemented today with something like if not triple in other, though I'd be curious to know if there are any cases for which it would not work simply to compare with set(other) == set(self). I just tried both approaches and the set-to-set comparison ran twice as fast.

ghost commented 2 years ago

What do you mean by '0.7' - is it a development branch in the works? (The latest release appears to be 6.3.0a0.) And are you saying that the revert would remove __eq__ entirely, leaving it unimplemented, as in pymantic? And you say "maybe" - is the revert under discussion somewhere?

"0.7" was a typo, I meant "7.0" (sorry) the next release which is planned for October, the aspects of which are being variously discussed here, here and here and (relatedly) here. Yes, there's a maintenance argument for keeping the code lean and if those operators merely duplicate Python's default behaviour, then they're superfluous and so become candidates for removal.

As yet, there's no discussion, one may be prompted by my comment, otherwise it'll happen during the thrashing out of the Graph/Dataset architecture changes --- the current class/subclassing inheritance of Dataset(ConjunctiveGraph(Graph)) is beset by fundamental Liskov Substitution Principle violations and needs to be reworked, possibly as a set of Mixin classes (work is in progress).

This, plus a significant legacy of issues arising from the extant flawed implementation of Dataset and a decade-long-overdue migration to the use of identifiers as context references (as opposed to using Graph objects) is prompting a fundamental reworking of the Graph/Dataset architecture (partially discussed in detail here) - as appropriate to a major semver release. If you're interested in chipping in, the discussion will be aiming to culminate in an ADR (@aucampia is in the process of creating the first RDFLib ADR)

FWIW, I like the AbstractStore.__eq__ listed above, though today's Graph does not have an exists method. I guess it could be implemented today with something like if not triple in other, though I'd be curious to know if there are any cases for which it would not work simply to compare with set(other) == set(self). I just tried both approaches and the set-to-set comparison ran twice as fast.

Personally, I think identifier equality for Graph (i.e. Pythonic default) is preferable because it will be consonant with the not-implemented equality operator for the reworked Dataset (the venerable but now idiosyncratic ConjunctiveGraph is to be retired).

I suspect that it's good practice for a library to avoid making assumptions where they're not actually required and, as regards equality vs equivalence, respecting the contextual nature of the difference and (pace appropriate documentation) devolving the decision to the programmer isn't really asking a lot.

joernhees’s observation (on a related issue of isomorphism and datatypes) remains pertinent:

I think the reason for this can be seen (as many weird things) in rdflib trying to find the "pythonic" way in between syntactic and semantic equivalence.

  1. If you write Literal("foo").n3(), would you expect "foo" or "foo"^^xsd:string?
  2. If you now parse "foo"^^xsd:string, and afterwards serialize it again, what would you expect?
tombaker commented 2 years ago

@gjhiggens

It's great to hear about RDFLib 7 and that legacy choices will be revisited! Responding somewhat out of order...:

there's a maintenance argument for keeping the code lean and if those operators merely duplicate Python's default behaviour ... Personally, I think identifier equality for Graph (i.e. Pythonic default) is preferable ... I think the reason for this can be seen (as many weird things) in rdflib trying to find the "pythonic" way in between syntactic and semantic equivalence.

In my reading, the SO post cited above says that identity comparison is used only as a fallback, in cases where __eq__ has not been implemented. (Is this based on a PEP?) I can see why one might consider that fallback as the "default", but I'd hesitate to consider it to be the most "Pythonic" choice for RDFLib Graph given that the most basic Python container types do implement __eq__. Lists, tuples, and sets, for example, base equivalence not on identity, but on contents. As RDFLib Graph is explicitly characterized as set-like, then by the Principle of Least Astonishment, I would expect it to quack like a set, which means basing graph equivalence on set comparison. It was, in fact, my astonishment at the negative result of a comparison of two graphs with exactly the same triples that motivated me to create this issue.

I suspect that it's good practice for a library to avoid making assumptions where they're not actually required and, as regards equality vs equivalence, respecting the contextual nature of the difference and (pace appropriate documentation) devolving the decision to the programmer isn't really asking a lot.

For equivalence testing, it is unclear to me what sort of "contextual nature" should be considered. That I am aware, no other Python container type requires out-of-band context for proper interpretation. According to RDF 1.1 Concepts and Abstract Syntax, "RDF graphs are static snapshots of information". If so, then basic object equivalence should be based strictly on what is explicitly asserted in the graphs. Everything else -- eg, anything that depends on context or that might be inferred, materialized, canonicalized, normalized, or in any other way transformed or interpreted with respect to the snapshot at hand -- should simply be out of scope. This would of course not preclude more advanced graph equivalence algorithms based on other considerations, such as isomorphism.

I would like to track this issue in the preparation of RDFLib 7.0. If this specific issue is closed, I'd appreciate if a link could be posted here to further discussion. Whatever the outcome, a rationale for the decision should perhaps be recorded in the ADR and acknowledged in the documentation. I would be happy to comment on any drafts.