Closed RickMoynihan closed 2 years ago
@lkitching I've assembled the data reproducing the inputs and the various states in the above zip.
The important files are:
barber.trig
is the test data that causes the issue.drafts-1.trig
is the pre publication state (with barber.trig loaded). This state looks correct to me.publish.sparql
is the generated sparql query that yields drafts-2.trig
(the broken end state)The remaining files are:
drafts.trig
it's just the empty state with a single draftset in it.)publish-attempted-fix.sparql
is a speculative fix; though currently unconfirmed to be correct or work I think it'll be in this ball park.I would have run the publish-attempted-fix.sparql
update myself but ran out of time today, and was having problems with my stardog environment running the update. Hopefully I can pass the batton on to you and we can get it over the line 🙇
This appears to be caused by an issue in both the rewriting queries between a draft and live. When inserting
data into a draft, the batch is appended into the draft graph and the entire draftset is then re-written
by a query built by draft-management/rewrite-draftset-q
.
For a draftset http://draft-set
rewrite-draftset-q
returns the following query:
DELETE { GRAPH ?g { ?lg ?p1 ?o1 . ?s2 ?lg ?o2 . ?s3 ?p3 ?lg . } }
INSERT { GRAPH ?g { ?dg ?p1 ?o1 . ?s2 ?dg ?o2 . ?s3 ?p3 ?dg . } }
WHERE {
GRAPH ?g { { ?lg ?p1 ?o1 } UNION
{ ?s2 ?lg ?o2 } UNION
{ ?s3 ?p3 ?lg } }
GRAPH <http://publishmydata.com/graphs/drafter/drafts> {
?ds <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://publishmydata.com/def/drafter/DraftSet> .
?g <http://publishmydata.com/def/drafter/inDraftSet> ?ds .
?dg <http://publishmydata.com/def/drafter/inDraftSet> ?ds .
?lg <http://publishmydata.com/def/drafter/hasDraft> ?dg .
FILTER EXISTS { GRAPH ?dg { ?s_ ?p_ ?o_ } }
VALUES ?ds { <http://draft-set> }
}
}
This query attempts to re-write all live graph URIs found in the data within all (draft) graphs in the draftset to their corresponding draft graph URIs. Each of the UNION clauses match the case where the live graph is in the subject, predicate or object position.
single-reference.trig
barber:record {
barber:record pmdcat:metadataGraph "Example" .
}
The above example only references the graph in the subject position, and after inserting into a new draftset, the draft graph contains the expected re-written statement:
<http://publishmydata.com/graphs/drafter/draft/e0165ced-7193-4897-9e28-88249172e796> {
<http://publishmydata.com/graphs/drafter/draft/e0165ced-7193-4897-9e28-88249172e796> pmdcat:metadataGraph "Example" .
}
This query does not work as expected for statements where the live graph matches multiple components within a statement.
barber-minimal.trig
@prefix pmdcat: <http://publishmydata.com/pmdcat#> .
@prefix qb: <http://purl.org/linked-data/cube#> .
@prefix barber: <http://gss-data.org.uk/dataset/subregional-infection-rate-of-covid-19/> .
barber:record {
barber:record pmdcat:metadataGraph barber:record .
}
The process for inserting this data into an empty draftset http://publishmydata.com/def/drafter/draftset/268e530c-c7d2-4a64-af7b-15822f771e7f
is as follows:
barber:record
barber:record pmdcat:metadataGraph barber:record
into the draft graph http://publishmydata.com/graphs/drafter/draft/09149c37-bb7b-49df-bdf6-1873269ca050
After step 2 above (just before the draftset is rewritten) barber:record
has a draft
http://publishmydata.com/graphs/drafter/draft/93ce47b9-30de-4ecb-9088-0d0722618f88
containing the following:
<http://publishmydata.com/graphs/drafter/draft/93ce47b9-30de-4ecb-9088-0d0722618f88> {
barber:record pmdcat:metadataGraph barber:record
}
running the rewriting query will result in the following:
<http://publishmydata.com/graphs/drafter/draft/93ce47b9-30de-4ecb-9088-0d0722618f88> {
<http://publishmydata.com/graphs/drafter/draft/93ce47b9-30de-4ecb-9088-0d0722618f88> pmdcat:metadataGraph barber:record .
barber:record pmdcat:metadataGraph <http://publishmydata.com/graphs/drafter/draft/93ce47b9-30de-4ecb-9088-0d0722618f88> .
}
This is because the statement barber:record pmdcat:metadataGraph barber:record
matches both the { ?lg p1 ?o1 }
and the
{ ?s3 ?p3 ?lg }
UNION clauses which leads to two deletes of the source triple and two different inserts into the
draft graph.
Note that if you run the draft rewrite query a second time the resulting draft graph will be as intended with a single re-written triple:
<http://publishmydata.com/graphs/drafter/draft/93ce47b9-30de-4ecb-9088-0d0722618f88> {
<http://publishmydata.com/graphs/drafter/draft/93ce47b9-30de-4ecb-9088-0d0722618f88> pmdcat:metadataGraph <http://publishmydata.com/graphs/drafter/draft/93ce47b9-30de-4ecb-9088-0d0722618f88> .
}
This is because the two triples are matched by two different UNION clauses and re-written into the same triple.
This is why the rewriting issue is not apparent when inserting barbers.trig
into a draftset - appended files are split
into batches grouped by source graph. The barber:record
graph triples are inserted first and re-written, but the
entire draft is then re-written again after inserting the batch for the <http://gss-data.org.uk/datacube/subregional-infection-rate-of-covid-19/structure>
graph. By reducing the example to a single graph, all the data is inserted within a single batch and the draftset
is only re-written once.
The reverse re-writing is carried out across all draft graphs within a draftset as part of the publication process.
This re-writing is performed by a query returned by draft-management/unrewrite-draftset-q
which requires the collection
of draft graphs within the draftset.
For a draftset containing the draft graphs http://draft1
and http://draft2
, unrewrite-draftset-q
returns the following query:
DELETE { GRAPH ?g { ?dg ?p1 ?o1 . ?s2 ?dg ?o2 . ?s3 ?p3 ?dg . } }
INSERT { GRAPH ?g { ?lg ?p1 ?o1 . ?s2 ?lg ?o2 . ?s3 ?p3 ?lg . } }
WHERE {
GRAPH ?g { { ?dg ?p1 ?o1 } UNION
{ ?s2 ?dg ?o2 } UNION
{ ?s3 ?p3 ?dg } }
GRAPH <http://publishmydata.com/graphs/drafter/drafts> {
?ds <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://publishmydata.com/def/drafter/DraftSet> .
?g <http://publishmydata.com/def/drafter/inDraftSet> ?ds .
?dg <http://publishmydata.com/def/drafter/inDraftSet> ?ds .
?lg <http://publishmydata.com/def/drafter/hasDraft> ?dg .
VALUES ?dg { <http://draft1> <http://draft2> }
}
}
This query suffers the same issue as the live-to-draft update i.e. that statements which contain the draft graph in more than one component are matched by multiple UNION clauses and result in multiple partially-rewritten statements being inserted into the source graph.
Bertrand Russell had his life's work destroyed by meta-circular paradoxes and issues like this, so we're in good company at least 😅
I'm being facetious though, as I don't think this is a true paradox; let's begin and start with a reproduction
We now have a state graph like this, no problems so far:
barbers.trig
file. Cut down from a real world example from the covid infection survey. Notebarber:record
is both a graph and a resource/subject, and we therefore have triples in the data speaking about a graph, "the barber".This now yields the following state graph. To my eyes this looks correct (but please double check and confirm):
The salient bits are:
which associates the draft graph with the barbers graph.
and this piece of rewritten data, which looks correct:
In particular the broken triples are:
which should be
barber:record pmdcat:metadataGraph barber:record .
and
<http://publishmydata.com/graphs/drafter/draft/f4ae3e50-f343-4785-badf-1206fe255470> pmdcat:metadataGraph barber:record .
which should also bebarber:record pmdcat:metadataGraph barber:record .
i.e. we have two triples where we should have one.Early thoughts
First thing to note is the problem appears to occur at publication (not during the rewriting during the append).
Secondly I wonder if this is an issue with reflexive triples which are in a graph of the same name.... i.e. quads of the following form?
the:barber ?any-pred the:barber the:barber