Swirrl / drafter

A clojure service and a client to it for exposing data management operations to PMD
Other
0 stars 0 forks source link

Metacircular rewriting bug #607

Closed RickMoynihan closed 2 years ago

RickMoynihan commented 2 years ago

Bertrand Russell had his life's work destroyed by meta-circular paradoxes and issues like this, so we're in good company at least 😅

image

I'm being facetious though, as I don't think this is a true paradox; let's begin and start with a reproduction

  1. Start an empty drafter (for simplicity)
  2. Create a draftset

We now have a state graph like this, no problems so far:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

<http://publishmydata.com/graphs/drafter/drafts> {
    <http://publishmydata.com/def/drafter/draftset/c3d5f758-cefd-4005-9a36-ad76f3f7acd2> a <http://publishmydata.com/def/drafter/DraftSet> ;
      <http://purl.org/dc/terms/modified> "2022-04-07T15:10:52.931Z"^^xsd:dateTime ;
      <http://publishmydata.com/def/drafter/version> <http://publishmydata.com/def/drafter/version/e5363868-7dd6-40d5-93a1-d75797838a29> ;
      <http://purl.org/dc/terms/created> "2022-04-07T15:10:52.931Z"^^xsd:dateTime ;
      <http://purl.org/dc/terms/creator> <mailto:4Q0JWbTZMGbSiO7BmdJmatXGNmDIWwuu@clients> ;
      <http://publishmydata.com/def/drafter/hasOwner> <mailto:4Q0JWbTZMGbSiO7BmdJmatXGNmDIWwuu@clients> ;
      rdfs:label "bug" .
}
  1. Now load the problematic barbers.trig file. Cut down from a real world example from the covid infection survey. Note barber:record is both a graph and a resource/subject, and we therefore have triples in the data speaking about a graph, "the barber".
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix pmdcat: <http://publishmydata.com/pmdcat#> .
@prefix qb: <http://purl.org/linked-data/cube#> .
@prefix barber: <http://gss-data.org.uk/dataset/subregional-infection-rate-of-covid-19/> .

barber:record {
    <http://gss-data.org.uk/catalog> dcat:record barber:record .
    barber:record foaf:primaryTopic <http://gss-data.org.uk/dataset/subregional-infection-rate-of-covid-19> .
    barber:record pmdcat:metadataGraph barber:record .
}

<http://gss-data.org.uk/datacube/subregional-infection-rate-of-covid-19/structure> {
    <http://gss-data.org.uk/datacube/subregional-infection-rate-of-covid-19> a qb:DataSet .
}
  1. We then load this in by putting the data as trig into the appropriate draftset.

This now yields the following state graph. To my eyes this looks correct (but please double check and confirm):

@prefix : <http://api.stardog.com/> .
@prefix stardog: <tag:stardog:api:> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix pmdcat: <http://publishmydata.com/pmdcat#> .
@prefix qb: <http://purl.org/linked-data/cube#> .
@prefix barber: <http://gss-data.org.uk/dataset/subregional-infection-rate-of-covid-19/> .

<http://publishmydata.com/graphs/drafter/drafts> {
   <http://publishmydata.com/def/drafter/draftset/c3d5f758-cefd-4005-9a36-ad76f3f7acd2> a <http://publishmydata.com/def/drafter/DraftSet> ;
      dcat:modified "2022-04-07T15:13:35.134Z"^^xsd:dateTime ;
      <http://publishmydata.com/def/drafter/version> <http://publishmydata.com/def/drafter/version/7cdbf9f4-1f1d-421d-923c-d794de64c653> ;
      dcat:created "2022-04-07T15:10:52.931Z"^^xsd:dateTime ;
      dcat:creator <mailto:4Q0JWbTZMGbSiO7BmdJmatXGNmDIWwuu@clients> ;
      <http://publishmydata.com/def/drafter/hasOwner> <mailto:4Q0JWbTZMGbSiO7BmdJmatXGNmDIWwuu@clients> ;
      rdfs:label "bug" .

  barber:record
        a <http://publishmydata.com/def/drafter/ManagedGraph> ;
        <http://publishmydata.com/def/drafter/isPublic> false ;
        <http://publishmydata.com/def/drafter/hasDraft> <http://publishmydata.com/graphs/drafter/draft/f4ae3e50-f343-4785-badf-1206fe255470> .

  <http://publishmydata.com/graphs/drafter/draft/f4ae3e50-f343-4785-badf-1206fe255470>
        a <http://publishmydata.com/def/drafter/DraftGraph> ;
        dcat:created "2022-04-07T15:13:35.229Z"^^xsd:dateTime ;
      <http://publishmydata.com/def/drafter/inDraftSet> <http://publishmydata.com/def/drafter/draftset/c3d5f758-cefd-4005-9a36-ad76f3f7acd2> .

  <http://publishmydata.com/graphs/drafter/graph-modified-times>
        a <http://publishmydata.com/def/drafter/ManagedGraph> ;
        <http://publishmydata.com/def/drafter/isPublic> false ;
        <http://publishmydata.com/def/drafter/hasDraft> <http://publishmydata.com/graphs/drafter/draft/de2ff3c2-9e47-49b5-8e61-d2dbd986ed7b> .

  <http://publishmydata.com/graphs/drafter/draft/de2ff3c2-9e47-49b5-8e61-d2dbd986ed7b>
        a <http://publishmydata.com/def/drafter/DraftGraph> ;
        dcat:created "2022-04-07T15:13:35.519Z"^^xsd:dateTime ;
        <http://publishmydata.com/def/drafter/inDraftSet> <http://publishmydata.com/def/drafter/draftset/c3d5f758-cefd-4005-9a36-ad76f3f7acd2> .

  <http://gss-data.org.uk/datacube/subregional-infection-rate-of-covid-19/structure>
        a <http://publishmydata.com/def/drafter/ManagedGraph> ;
        <http://publishmydata.com/def/drafter/isPublic> false ;
        <http://publishmydata.com/def/drafter/hasDraft> <http://publishmydata.com/graphs/drafter/draft/37487891-3413-41a5-9316-cc85f0bc977f> .

  <http://publishmydata.com/graphs/drafter/draft/37487891-3413-41a5-9316-cc85f0bc977f>
        a <http://publishmydata.com/def/drafter/DraftGraph> ;
        dcat:created "2022-04-07T15:13:35.742Z"^^xsd:dateTime ;
        <http://publishmydata.com/def/drafter/inDraftSet> <http://publishmydata.com/def/drafter/draftset/c3d5f758-cefd-4005-9a36-ad76f3f7acd2> .
}

<http://publishmydata.com/graphs/drafter/draft/f4ae3e50-f343-4785-badf-1206fe255470> {
  <http://publishmydata.com/graphs/drafter/draft/f4ae3e50-f343-4785-badf-1206fe255470>
        foaf:primaryTopic <http://gss-data.org.uk/dataset/subregional-infection-rate-of-covid-19> ;
        pmdcat:metadataGraph <http://publishmydata.com/graphs/drafter/draft/f4ae3e50-f343-4785-badf-1206fe255470> .

  <http://gss-data.org.uk/catalog> dcat:record <http://publishmydata.com/graphs/drafter/draft/f4ae3e50-f343-4785-badf-1206fe255470> .
}

<http://publishmydata.com/graphs/drafter/draft/de2ff3c2-9e47-49b5-8e61-d2dbd986ed7b> {
    <http://publishmydata.com/graphs/drafter/draft/f4ae3e50-f343-4785-badf-1206fe255470> dcat:modified "2022-04-07T15:13:35.134Z"^^xsd:dateTime .
    <http://publishmydata.com/graphs/drafter/draft/de2ff3c2-9e47-49b5-8e61-d2dbd986ed7b> dcat:modified "2022-04-07T15:13:35.134Z"^^xsd:dateTime .
    <http://publishmydata.com/graphs/drafter/draft/37487891-3413-41a5-9316-cc85f0bc977f> dcat:modified "2022-04-07T15:13:35.134Z"^^xsd:dateTime .
}

<http://publishmydata.com/graphs/drafter/draft/37487891-3413-41a5-9316-cc85f0bc977f> {
        <http://gss-data.org.uk/datacube/subregional-infection-rate-of-covid-19> a qb:DataSet .
}

The salient bits are:

  barber:record
        a <http://publishmydata.com/def/drafter/ManagedGraph> ;
        <http://publishmydata.com/def/drafter/isPublic> false ;
        <http://publishmydata.com/def/drafter/hasDraft> <http://publishmydata.com/graphs/drafter/draft/f4ae3e50-f343-4785-badf-1206fe255470> .

which associates the draft graph with the barbers graph.

and this piece of rewritten data, which looks correct:

<http://publishmydata.com/graphs/drafter/draft/f4ae3e50-f343-4785-badf-1206fe255470> {
  <http://publishmydata.com/graphs/drafter/draft/f4ae3e50-f343-4785-badf-1206fe255470>
        foaf:primaryTopic <http://gss-data.org.uk/dataset/subregional-infection-rate-of-covid-19> ;
        pmdcat:metadataGraph <http://publishmydata.com/graphs/drafter/draft/f4ae3e50-f343-4785-badf-1206fe255470> .

  <http://gss-data.org.uk/catalog> dcat:record <http://publishmydata.com/graphs/drafter/draft/f4ae3e50-f343-4785-badf-1206fe255470> .
}
  1. Finally publish the draft and witness the broken data in the public endpoint:
@prefix : <http://api.stardog.com/> .
@prefix stardog: <tag:stardog:api:> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix pmdcat: <http://publishmydata.com/pmdcat#> .
@prefix qb: <http://purl.org/linked-data/cube#> .
@prefix barber: <http://gss-data.org.uk/dataset/subregional-infection-rate-of-covid-19/> .

<http://publishmydata.com/graphs/drafter/endpoints> {
    <http://publishmydata.com/def/drafter/public> a <http://publishmydata.com/def/drafter/Endpoint> ;
      dcat:modified "2022-04-07T15:22:14.272Z"^^xsd:dateTime ;
      dcat:issued "2022-04-07T14:03:55.328Z"^^xsd:dateTime ;
      <http://publishmydata.com/def/drafter/version> <http://publishmydata.com/def/drafter/version/57f1cd3b-7cfd-4822-a136-9086a23c3a3c> .
}

<http://publishmydata.com/graphs/drafter/drafts> {
    barber:record a <http://publishmydata.com/def/drafter/ManagedGraph> ;
      dcat:issued "2022-04-07T15:22:12.934Z"^^xsd:dateTime ;
      <http://publishmydata.com/def/drafter/isPublic> true .
    <http://publishmydata.com/graphs/drafter/graph-modified-times> a <http://publishmydata.com/def/drafter/ManagedGraph> ;
      <http://publishmydata.com/def/drafter/isPublic> true .
    <http://gss-data.org.uk/datacube/subregional-infection-rate-of-covid-19/structure> a <http://publishmydata.com/def/drafter/ManagedGraph> ;
      dcat:issued "2022-04-07T15:22:12.934Z"^^xsd:dateTime ;
      <http://publishmydata.com/def/drafter/isPublic> true .
}

barber:record {
    barber:record <http://xmlns.com/foaf/0.1/primaryTopic> <http://gss-data.org.uk/dataset/subregional-infection-rate-of-covid-19> ;
      pmdcat:metadataGraph <http://publishmydata.com/graphs/drafter/draft/f4ae3e50-f343-4785-badf-1206fe255470> .

    <http://publishmydata.com/graphs/drafter/draft/f4ae3e50-f343-4785-badf-1206fe255470> pmdcat:metadataGraph barber:record .
    <http://gss-data.org.uk/catalog> <http://www.w3.org/ns/dcat#record> barber:record .
}

<http://publishmydata.com/graphs/drafter/graph-modified-times> {
    barber:record dcat:modified "2022-04-07T15:13:35.134Z"^^xsd:dateTime .
    <http://publishmydata.com/graphs/drafter/graph-modified-times> dcat:modified "2022-04-07T15:13:35.134Z"^^xsd:dateTime .
    <http://gss-data.org.uk/datacube/subregional-infection-rate-of-covid-19/structure> dcat:modified "2022-04-07T15:13:35.134Z"^^xsd:dateTime .
}

<http://gss-data.org.uk/datacube/subregional-infection-rate-of-covid-19/structure> {
    <http://gss-data.org.uk/datacube/subregional-infection-rate-of-covid-19> a <http://purl.org/linked-data/cube#DataSet> .
}

In particular the broken triples are:

barber:record pmdcat:metadataGraph <http://publishmydata.com/graphs/drafter/draft/f4ae3e50-f343-4785-badf-1206fe255470> .

which should be barber:record pmdcat:metadataGraph barber:record .

and <http://publishmydata.com/graphs/drafter/draft/f4ae3e50-f343-4785-badf-1206fe255470> pmdcat:metadataGraph barber:record . which should also be barber:record pmdcat:metadataGraph barber:record . i.e. we have two triples where we should have one.

Early thoughts

First thing to note is the problem appears to occur at publication (not during the rewriting during the append).

Secondly I wonder if this is an issue with reflexive triples which are in a graph of the same name.... i.e. quads of the following form?

the:barber ?any-pred the:barber the:barber

RickMoynihan commented 2 years ago

drafter-barber-bug.zip

@lkitching I've assembled the data reproducing the inputs and the various states in the above zip.

The important files are:

The remaining files are:

I would have run the publish-attempted-fix.sparql update myself but ran out of time today, and was having problems with my stardog environment running the update. Hopefully I can pass the batton on to you and we can get it over the line 🙇

lkitching commented 2 years ago

This appears to be caused by an issue in both the rewriting queries between a draft and live. When inserting data into a draft, the batch is appended into the draft graph and the entire draftset is then re-written by a query built by draft-management/rewrite-draftset-q.

For a draftset http://draft-set rewrite-draftset-q returns the following query:

DELETE { GRAPH ?g { ?lg ?p1 ?o1 . ?s2 ?lg ?o2 . ?s3 ?p3 ?lg . } }
INSERT { GRAPH ?g { ?dg ?p1 ?o1 . ?s2 ?dg ?o2 . ?s3 ?p3 ?dg . } }
WHERE {
  GRAPH ?g { { ?lg ?p1 ?o1 } UNION
             { ?s2 ?lg ?o2 } UNION
             { ?s3 ?p3 ?lg } }
  GRAPH <http://publishmydata.com/graphs/drafter/drafts> {
    ?ds <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://publishmydata.com/def/drafter/DraftSet> .
    ?g <http://publishmydata.com/def/drafter/inDraftSet> ?ds .
    ?dg <http://publishmydata.com/def/drafter/inDraftSet> ?ds .
    ?lg <http://publishmydata.com/def/drafter/hasDraft> ?dg .
    FILTER EXISTS { GRAPH ?dg { ?s_ ?p_ ?o_ } }
    VALUES ?ds { <http://draft-set> }
  }
}

This query attempts to re-write all live graph URIs found in the data within all (draft) graphs in the draftset to their corresponding draft graph URIs. Each of the UNION clauses match the case where the live graph is in the subject, predicate or object position.

single-reference.trig

barber:record {
    barber:record pmdcat:metadataGraph "Example" .
}

The above example only references the graph in the subject position, and after inserting into a new draftset, the draft graph contains the expected re-written statement:

<http://publishmydata.com/graphs/drafter/draft/e0165ced-7193-4897-9e28-88249172e796> {
    <http://publishmydata.com/graphs/drafter/draft/e0165ced-7193-4897-9e28-88249172e796> pmdcat:metadataGraph "Example" .
}

This query does not work as expected for statements where the live graph matches multiple components within a statement.

barber-minimal.trig

@prefix pmdcat: <http://publishmydata.com/pmdcat#> .
@prefix qb: <http://purl.org/linked-data/cube#> .
@prefix barber: <http://gss-data.org.uk/dataset/subregional-infection-rate-of-covid-19/> .

barber:record {
    barber:record pmdcat:metadataGraph barber:record .
}

The process for inserting this data into an empty draftset http://publishmydata.com/def/drafter/draftset/268e530c-c7d2-4a64-af7b-15822f771e7f is as follows:

  1. Create a new draft graph for the live graph barber:record
  2. Insert the statement barber:record pmdcat:metadataGraph barber:record into the draft graph http://publishmydata.com/graphs/drafter/draft/09149c37-bb7b-49df-bdf6-1873269ca050
  3. Rewrite the entire draftset with the above query

After step 2 above (just before the draftset is rewritten) barber:record has a draft http://publishmydata.com/graphs/drafter/draft/93ce47b9-30de-4ecb-9088-0d0722618f88 containing the following:

<http://publishmydata.com/graphs/drafter/draft/93ce47b9-30de-4ecb-9088-0d0722618f88> {
    barber:record pmdcat:metadataGraph barber:record
}

running the rewriting query will result in the following:

<http://publishmydata.com/graphs/drafter/draft/93ce47b9-30de-4ecb-9088-0d0722618f88> {
    <http://publishmydata.com/graphs/drafter/draft/93ce47b9-30de-4ecb-9088-0d0722618f88> pmdcat:metadataGraph barber:record .
    barber:record pmdcat:metadataGraph <http://publishmydata.com/graphs/drafter/draft/93ce47b9-30de-4ecb-9088-0d0722618f88> .
}

This is because the statement barber:record pmdcat:metadataGraph barber:record matches both the { ?lg p1 ?o1 } and the { ?s3 ?p3 ?lg } UNION clauses which leads to two deletes of the source triple and two different inserts into the draft graph.

Note that if you run the draft rewrite query a second time the resulting draft graph will be as intended with a single re-written triple:

<http://publishmydata.com/graphs/drafter/draft/93ce47b9-30de-4ecb-9088-0d0722618f88> {
  <http://publishmydata.com/graphs/drafter/draft/93ce47b9-30de-4ecb-9088-0d0722618f88> pmdcat:metadataGraph <http://publishmydata.com/graphs/drafter/draft/93ce47b9-30de-4ecb-9088-0d0722618f88> .
}

This is because the two triples are matched by two different UNION clauses and re-written into the same triple.

This is why the rewriting issue is not apparent when inserting barbers.trig into a draftset - appended files are split into batches grouped by source graph. The barber:record graph triples are inserted first and re-written, but the entire draft is then re-written again after inserting the batch for the <http://gss-data.org.uk/datacube/subregional-infection-rate-of-covid-19/structure> graph. By reducing the example to a single graph, all the data is inserted within a single batch and the draftset is only re-written once.

The reverse re-writing is carried out across all draft graphs within a draftset as part of the publication process. This re-writing is performed by a query returned by draft-management/unrewrite-draftset-q which requires the collection of draft graphs within the draftset.

For a draftset containing the draft graphs http://draft1 and http://draft2, unrewrite-draftset-q returns the following query:

DELETE { GRAPH ?g { ?dg ?p1 ?o1 . ?s2 ?dg ?o2 . ?s3 ?p3 ?dg . } }
INSERT { GRAPH ?g { ?lg ?p1 ?o1 . ?s2 ?lg ?o2 . ?s3 ?p3 ?lg . } }
WHERE {
  GRAPH ?g { { ?dg ?p1 ?o1 } UNION
             { ?s2 ?dg ?o2 } UNION
             { ?s3 ?p3 ?dg } }
  GRAPH <http://publishmydata.com/graphs/drafter/drafts> {
    ?ds <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://publishmydata.com/def/drafter/DraftSet> .
    ?g <http://publishmydata.com/def/drafter/inDraftSet> ?ds .
    ?dg <http://publishmydata.com/def/drafter/inDraftSet> ?ds .
    ?lg <http://publishmydata.com/def/drafter/hasDraft> ?dg .
    VALUES ?dg { <http://draft1> <http://draft2> }
  }
}

This query suffers the same issue as the live-to-draft update i.e. that statements which contain the draft graph in more than one component are matched by multiple UNION clauses and result in multiple partially-rewritten statements being inserted into the source graph.