Certain nesting pattern causes oscillation in "sorted" Turtle state

ajnelson-nist commented 1 year ago

Hello,

I do some work with an open source community that has data maintenance processes that include using RDF Toolkit to normalize Turtle text.

Unfortunately, a few times we have encountered issues with nested blank nodes triggering some kind of incorrect sorting behavior. Most recently, this came up in a SHACL shape for a class that constrains values for two properties in a matching way. The constraint is that the properties' values need to be a class-member of neither class X nor class Y. Implementing this logic in SHACL ends up causing a 4-deep--5-deep reference path of blank nodes. (Alternative SHACL spellings are available using IRI-identified nodes to sidestep this issue with RDF Toolkit, but I do not think that would be an appropriate influence of a normalization tool on data management.)

Something about this structure---probably the blank node subgraphs being isomorphic except for one individual reference in the top blank node, and having an rdf:List deeper down---confuses RDF Toolkit's sorting, and causes a reliable oscillation between "sorted" states.

Here is a minimally reproducing example Turtle graph, excerpted from this file:

@prefix obo: <http://purl.obolibrary.org/obo/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://example.org/shapes/bfo/BFO_0000004-shape>
    a sh:NodeShape ;
    rdfs:comment "independent continuant"@en ;
    sh:property
        [
            a sh:PropertyShape ;
            sh:not [
                a sh:NodeShape ;
                sh:or (
                    [
                        a sh:NodeShape ;
                        sh:class obo:BFO_0000020 ;
                    ]
                    [
                        a sh:NodeShape ;
                        sh:class obo:BFO_0000031 ;
                    ]
                ) ;
            ] ;
            sh:path obo:BFO_0000176 ;
        ] ,
        [
            a sh:PropertyShape ;
            sh:not [
                a sh:NodeShape ;
                sh:or (
                    [
                        a sh:NodeShape ;
                        sh:class obo:BFO_0000020 ;
                    ]
                    [
                        a sh:NodeShape ;
                        sh:class obo:BFO_0000031 ;
                    ]
                ) ;
            ] ;
            sh:path obo:BFO_0000186 ;
        ]
        ;
    sh:targetClass obo:BFO_0000004 ;
    .

The problem observed is that running the file in this state through RDF Toolkit, --source-format and --target-format both Turtle, causes the two lines with sh:path in them (values obo:BFO_0000176 and obo:BFO_0000186) to swap places. Running that state through RDF Toolkit returns the original input.

I encountered this issue today using Java 18, rdf-toolkit.jar v1.14.2. We've previously encountered this in Java Temurin v11, on the first version (or something close to there in time) of rdf-toolkit.jar that had dropped support for Java 8.

mereolog commented 1 year ago

@ajnelson-nist thanks for that, but I was not able to reconstruct your problem. Using the latest commit (with the default settings) on your excerpt I got this:

@prefix obo: <http://purl.obolibrary.org/obo/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://example.org/shapes/bfo/BFO_0000004-shape>
    a sh:NodeShape ;
    rdfs:comment "independent continuant"@en ;
    sh:property
        _:blank11 ,
        _:blank12
        ;
    sh:targetClass obo:BFO_0000004 ;
    .

_:blank01
    rdf:first _:blank05 ;
    rdf:rest _:blank03 ;
    .

_:blank02
    rdf:first _:blank06 ;
    rdf:rest _:blank04 ;
    .

_:blank03
    rdf:first _:blank07 ;
    rdf:rest rdf:nil ;
    .

_:blank04
    rdf:first _:blank08 ;
    rdf:rest rdf:nil ;
    .

_:blank05
    a sh:NodeShape ;
    sh:class obo:BFO_0000020 ;
    .

_:blank06
    a sh:NodeShape ;
    sh:class obo:BFO_0000020 ;
    .

_:blank07
    a sh:NodeShape ;
    sh:class obo:BFO_0000031 ;
    .

_:blank08
    a sh:NodeShape ;
    sh:class obo:BFO_0000031 ;
    .

_:blank09
    a sh:NodeShape ;
    sh:or _:blank02 ;
    .

_:blank10
    a sh:NodeShape ;
    sh:or _:blank01 ;
    .

_:blank11
    a sh:PropertyShape ;
    sh:not _:blank09 ;
    sh:path obo:BFO_0000176 ;
    .

_:blank12
    a sh:PropertyShape ;
    sh:not _:blank10 ;
    sh:path obo:BFO_0000186 ;
    .

Then I serialised this outcome again, but I got exactly the same file.

ajnelson-nist commented 1 year ago

My apologies, I forgot to give my whole command. It's these lines of the accompanying Makefile:

.check-%.ttl: \
  %.ttl \
  $(RDF_TOOLKIT_JAR)
    java -jar $(RDF_TOOLKIT_JAR) \
      --inline-blank-nodes \
      --source $< \
      --source-format turtle \
      --target _$@ \
      --target-format turtle
    mv _$@ $@

The relevant detail I should have included: I am using the --inline-blank-nodes flag.

mereolog commented 1 year ago

@ajnelson-nist I have it now - it is odd, we will look into it shortly.

mereolog commented 1 year ago

This was a tough one, but I hope I got to the bottom of the issue you reported.

In a nutshell, it is a non-feature, which cannot be fixed without major changes in the serialisation algorithm or perhaps cannot be fixed simpliciter.

A longer explanation is that the serialiser sorts an RDF graph by sorting all its triples, and triples are sorted first with respect to their subjects, then with respect to their predicates, then with respect to their objects. So the serialiser needs to sort nodes first:

for IRIs it is a simple lexicographic order
(simplifying the details) for bnodes there are two cases: (i) bnodes that represent collections are sorted on the basis of their members (ii) other bnodes are either sorted on the basis of the triples where they occur as subjects or on the lexicographic order of their temporary internal identifiers.

It is 2(ii) that causes the odd behaviour you reported.

Now the full explanation is rather convoluted.

First take these bnodes: A. specified in lines 17-20:

[
a sh:NodeShape ;
sh:class obo:BFO_0000020 ;
]

B. specified in lines 34-37:

[
a sh:NodeShape ;
sh:class obo:BFO_0000020 ;
]

Suppose that you need to order/sort them - how can you do this? They are not collections, so you cannot use 2(i). What about 2(ii)? Since each of them occurs in two triples as subjects and each respective triple has the same predicates and objects, you cannot order/sort them on this basis. So, within the current assumptions we make, they can be sorted on the basis of their temporary identifiers assigned by RDF4J and it happens that, probably due to the RDF4J algorithm, the identifier for the bnode B always gets sorted before the identifier bnode A.

A similar sorting occurs for bnodes in lines 21-24 and 38-41.

Then the serialiser gets to sort bnodes: C. specified in lines 16-25:

sh:or (
            [
                a sh:NodeShape ;
                sh:class obo:BFO_0000020 ;
            ]
            [
                a sh:NodeShape ;
                sh:class obo:BFO_0000031 ;
            ]
        ) ;

D. specified in lines 33-42:

sh:or (
            [
                a sh:NodeShape ;
                sh:class obo:BFO_0000020 ;
            ]
            [
                a sh:NodeShape ;
                sh:class obo:BFO_0000031 ;
            ]
        ) ;

Both bnodes are lists, so the serialiser sorts them on the basis of their members. The first member of D, i.e., bnode B, is sorted before the first member of C, i.e., bnode A, therefore D gets sorted before C.

Etc.

Finally, the serialiser tries to sort the "top" bnodes: E. specified in lines 12-28:

[
    a sh:PropertyShape ;
    sh:not [
        a sh:NodeShape ;
        sh:or (
            [
                a sh:NodeShape ;
                sh:class obo:BFO_0000020 ;
            ]
            [
                a sh:NodeShape ;
                sh:class obo:BFO_0000031 ;
            ]
        ) ;
    ] ;
    sh:path obo:BFO_0000176 ;
]

F. specified in lines 29-45:

[
    a sh:PropertyShape ;
    sh:not [
        a sh:NodeShape ;
        sh:or (
            [
                a sh:NodeShape ;
                sh:class obo:BFO_0000020 ;
            ]
            [
                a sh:NodeShape ;
                sh:class obo:BFO_0000031 ;
            ]
        ) ;
    ] ;
    sh:path obo:BFO_0000186 ;
]

They are not lists, but they occur as subjects in three respective triples:

... a sh:PropertyShape
...
```
sh:not [
    a sh:NodeShape ;
    sh:or (
        [
            a sh:NodeShape ;
            sh:class obo:BFO_0000020 ;
        ]
        [
            a sh:NodeShape ;
            sh:class obo:BFO_0000031 ;
        ]
    ) ;
]
```
1. ... sh:path obo:BFO_0000176. The first and third one have the same predicates and object, so we can sort E and F only on the basis of the second ones. And because the F triple with the sh:notpredicate has the object which gets sorted before the object in the respective E triple, bnode F gets sorted before bnode E. As a result, this looks as if BFO_0000186 was swapped with BFO_0000176.

Then the second run of the serialiser performs a similar operation, but this time BFO_0000176 is swapped with BFO_0000186.

I suspect that the only way out is use a radically different sorting algorithm, e.g., the one discussed 10.1109/Blockchain55522.2022.00069, but even then we may encounter some odd artefacts.

fkleedorfer commented 2 weeks ago

Sorry to be necromancing, but this issue just got fixed in turtle-formatter which suffered from it as well. The approach was to listen to the parse process and build a list of blank nodes in the order they are encountered by the parser. This allows you to reproduce the ordering of blank nodes when needed. turtle-formatter is based on jena, so you cannot use the fix applied there, but I think all you have to do is add an RdfHandler to the TurtleParser, record blank nodes in handleStatement() and pass the list on to the formatter where you use it to order the blank nodes wherever they appear.

mereolog commented 2 weeks ago

@fkleedorfer, thanks for your insight.

We will look into this in due time, which may not happen soon :(.

edmcouncil / rdf-toolkit

Certain nesting pattern causes oscillation in "sorted" Turtle state #49