N3 formula/rule persistence not idempotent

tduval-unifylogic commented 2 years ago

Was just taking the n3 persistence fix for a spin and noticed that if I load the same QuotedGraph(store, identifier='') multiple times, the formula is replicated, not sure if this is by design?

Ran this three times successively and you can see the formula is now part of the QuotedGraph three times:

mwatts15 commented 2 years ago

thanks for the report! i wouldn't say it's by design, but duplicate removal isn't entirely consistent. will have a look later.

ghost commented 2 years ago

It's not quite an idempotency issue because the parser creates its own internal references for bnodes and formula graphs which differ with each parsing.

The clue is in repeated _:foo a rdf:Class statements. The RDFLib Notation3 parser allocates a different BNode identifier (to the bnode represented by _:foo in the input graph) each time the graph is parsed and so syntax-based idempotency fails:

(BNode('fcbfd5c4...'), RDF.type RDFS.Class) != (BNode('fb838c78...'), RDF.type, RDFS.Class)

The same thing happens to the reified statements. With each parsing the RDFLib Notation3 parser allocates a different (Formula<n>) identifier to each of the (unnamed in the input graph) quoted graphs, the difference in Graph identifiers means that:

(
    <Graph identifier=Formula2 (<class 'rdflib.graph.QuotedGraph'>)>,
    rdflib.term.URIRef('http://www.w3.org/2000/10/swap/log#implies'),
    <Graph identifier=Formula3 (<class 'rdflib.graph.QuotedGraph'>)>
)

is not syntactically identical to:

(
    <Graph identifier=Formula5 (<class 'rdflib.graph.QuotedGraph'>)>,
    rdflib.term.URIRef('http://www.w3.org/2000/10/swap/log#implies'),
    <Graph identifier=Formula6 (<class 'rdflib.graph.QuotedGraph'>)>
)

which is why the graph contents increase with each parsing iteration.

It's a thing with any serialized graph that contains bnodes, e.g. ntriples:

def test_repeated_parse_of_graph_with_bnode():
    data = "<http://example/s> <http://example/p> _:foo ."
    g = Graph()
    g.parse(data=data, format="nt")
    assert len(list(g)) == 1
    # [
    #   (URIRef('http://example/s'), URIRef('http://example/p'), BNode('N52651bf32e9f4383abbe32ed16bcee54'))
    # ]
    g.parse(data=data, format="nt")
    assert len(list(g)) == 2
    # [
    #   (URIRef('http://example/s'), URIRef('http://example/p'), BNode('Nc14bee66a8e6407aa89fdd1ce9c51315')),
    #   (URIRef('http://example/s'), URIRef('http://example/p'), BNode('N52651bf32e9f4383abbe32ed16bcee54'))
    # ]

And, if you'll forgive the liberty, your use of QuotedGraph isn't quite as intended. As it happens, I'm having to deal with internal representations as part of reworking the RDFLib Dataset and as part of that, I'm collating some workings-out as documentation. Here's my workings-out of this issue that allowed me to build the requisite understanding:

def test_n3_rule():

    test_n3 = """@prefix rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
    @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
    @prefix : <http://test/> .
    {:a :b :c;a :foo} => {:a :d :c,?y}.
    _:foo a rdfs:Class.
    :a :d :c."""

    implies = URIRef("http://www.w3.org/2000/10/swap/log#implies")

    expected = [
        """@prefix : <http://test/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

[] a rdfs:Class .

:a :d :c .

{
    :a a :foo ;
        :b :c .

} => {
        :a :d ?y,
                :c .

    } .

""",
    """@prefix : <http://test/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

[] a rdfs:Class .

{
    :a a :foo ;
        :b :c .

} => {
        :a :d ?y,
                :c .

    } .

:a :d :c .

""",
    ]

    g = Graph()

    g.parse(data=test_n3, format="n3")

    gser = g.serialize(format="n3")

    assert gser in expected

    # Programmatically recontruct the input graph

    a = URIRef("http://test/a")
    b = URIRef("http://test/b")
    c = URIRef("http://test/c")
    d = URIRef("http://test/d")

    formulaA = QuotedGraph("Memory", BNode("FormulaA"))

    # {
    #     :a a :foo ;
    #         :b :c .
    # }

    formulaA.add(
        (
            a,
            URIRef("http://www.w3.org/1999/02/22-rdf-syntax-ns#type"),
            URIRef("http://test/foo"),
        )
    )
    formulaA.add((a, b, c))

    formulaB = QuotedGraph("Memory", BNode("FormulaB"))

    # {
    #     :a :d ?y,
    #             :c .

    # }

    formulaB.add((a, d, Variable("y")))
    formulaB.add((a, d, c))

    g2 = Graph()
    g2.bind("", URIRef("http://test/"))

    # :a :d :c .
    g2.add((a, d, c))

    # [] a rdfs:Class .
    g2.add(
        (
            BNode("foo"),
            URIRef("http://www.w3.org/1999/02/22-rdf-syntax-ns#type"),
            URIRef("http://www.w3.org/2000/01/rdf-schema#Class"),
        )
    )

    # } => {

    g2.add((formulaA, implies, formulaB))

    g2ser = g2.serialize(format="n3")

    # Voilà
    assert g2ser in expected

The re-worked Dataset implementation shows the parsed structure a little more clearly when serialized as TriG:

def test_n3_ds():

    test_n3 = """@prefix rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
    @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
    @prefix : <http://test/> .
    {:a :b :c;a :foo} => {:a :d :c,?y}.
    _:foo a rdfs:Class.
    :a :d :c."""

    ds = Dataset()
    ds.bind("log", URIRef("http://www.w3.org/2000/10/swap/log#"))

    # Create named subgraph as input destination
    g = ds.graph(URIRef("http://test/example"))

    g.parse(data=test_n3, format="n3")

    ds.serialize(format='trig')

@prefix : <http://test/> .
@prefix log: <http://www.w3.org/2000/10/swap/log#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

_:Formula2 {
    :a a :foo ;
        :b :c .
}

_:Formula3 {
    :a :d ?y,
            :c .
}

:example {
    [] a rdfs:Class .

    :a :d :c .

    {_:Formula2} log:implies {_:Formula3} .
}

mwatts15 commented 2 years ago

Thanks, @gjhiggins. If I understand correctly, your saying this is an rdflib issue, rather than an issue with rdflib-sqlalchemy. Is that right?

ghost commented 2 years ago

If I understand correctly, your saying this is an rdflib issue, rather than an issue with rdflib-sqlalchemy. Is that right?

Correct, not an issue with rdflib-sqlalchemy, not even an issue with RDFLib. It's just the way it is with blank node identifiers and, for RDFLib, what the OP reports is the expected and correct behaviour.

https://www.w3.org/TR/rdf11-concepts/#section-blank-nodes

Implementations that handle blank node identifiers in concrete syntaxes need to be careful not to create the same blank node from multiple occurrences of the same blank node identifier

repeated parsing of the same graph effectively results in multiple occurrences of the same blank node identifier.

tduval-unifylogic commented 2 years ago

@gjhiggins

Thanks so much for the thoughtful response here. It makes sense. This is why I asked if "was by design" (BNode reification, etc.)

The examples you created above are excellent guidance on what to do vs. not do! I've recently become an N3 Logic evangelist, so I hope what you provided is helpful not just me, but for others to promote wider adoption of it.

On the QuotedGraph topic, I appreciate the feedback/input. I was employing it based on the guidance here for RDFLib Documentation on QuotedGraph.

Is there a way to reflect the guidance you provided here in the Documentation?

ghost commented 2 years ago

Thanks so much for the thoughtful response here. It makes sense.

Glad to be of help.

On the QuotedGraph topic, I appreciate the feedback/input. I was employing it based on the guidance here for RDFLib Documentation on QuotedGraph.

Yes, that is a little on the sparse side, to say the least.

Is there a way to reflect the guidance you provided here in the Documentation?

It will be integrated. There's a 0.7 release of RDFLib in the planning and I'll be helping to revamp the RDFLib narrative documentation (the API documentation will hugely benefit from @aucampia's sterling work on adding type hinting). There's a fair amount of existing documentation in various places which can be usefully integrated --- such as (pertinent to your interests, I suspect) Chimezie's BNode Drama for your Mama blog post, languishing in the old “rdfextras” docs.

tduval-unifylogic commented 2 years ago

Thanks! Awesome blog! Closing...

RDFLib / rdflib-sqlalchemy

N3 formula/rule persistence not idempotent #96