Swirrl / drafter

A clojure service and a client to it for exposing data management operations to PMD
Other
0 stars 0 forks source link

Perfect Rewriting Redux: Rewriting not quite perfect in some publishing scenarios #618

Open RickMoynihan opened 2 years ago

RickMoynihan commented 2 years ago

@lkitching's recent PR #614 improves rewriting data on the way in such that it resolves #607.

However during the review of the #614 PR it was noted there was an additional unconsidered edge case. This edge case is a problem that already existed prior to #614, and has been shown to exist after #614 too, though #614 does fix the initial reported issue and is deemed overall to be an important behavioural fix.

Hypothetical scenario where problem can occur

Imagine users are maintaining for example a dcat catalog of datasets. (NOTE the example here is outside of muttnik's supported dcat profile; but this is drafter so strictly speaking that profile isn't our concern, and it's just a simplified example (omitting dcat:Record), the same issue could happen inside muttnik's AP too).

Imagine the following state is already published to the live site via drafter. NOTE that the imagined pattern in use here is that each datasets data is published into a graph which shares the same URI as the dataset URI, and at least one other graph (here the catalog) contains references to those URIs:

@prefix dcat: <http://www.w3.org/ns/dcat#> .

</my-catalogue/graph> { 
  </my-catalogue> a dcat:Catalog .

  </my-catalogue> dcat:dataset </datasets/one> .   
  </my-catalogue> dcat:dataset </datasets/two> .
  </my-catalogue> dcat:dataset </datasets/three> .
  # ...
  </my-catalogue> dcat:dataset </datasets/one-thousand> .
  # ...
} 

</datasets/one> {
  </obs/one> qb:dataSet </datasets/one>  .
  # ...
}

</datasets/two> {
  </obs/two> qb:dataSet </datasets/two>  .
  # ...
}

</datasets/three> {
  </obs/three> qb:dataSet </datasets/three>  .
  # ...
}

</datasets/one-thousand> {
  </obs/one-thousand> qb:dataSet </datasets/one-thousand>  .
  # ...
}

Next assume a user wants to add some more data to one of the datasets (/datasets/three). They create a new draftset and providing the following quad:

</datasets/three> {
  </obs/new-triple> qb:dataSet </datasets/three>  .
  # ...
}

This results in the </datasets/three> graph being cloned into the draftset, and the quads being rewritten behind the scenes to something like:

(TODO replace this with what we really get)

</draft/graph/id/for/datasets/three> {
  </obs/new-triple> qb:dataSet </draft/graph/id/for/datasets/three>  .
  </obs/three> qb:dataSet </draft/graph/id/for/datasets/three>  .
  # ...
}

Now there is a problem...

The problem is that our current implementation only rewrites graphs which contain real changes; and doesn't rewrite data/graphs which have no changes in the draftset.

i.e. the following SPARQL query against this draftset (when run with ?union-with-live=true) would fail:

ASK { 
   ?catalog dcat:catalog </datasets/three>
}

Because we haven't also rewritten the catalog graph to include this statement:

</my-catalogue/graph> { 
  </my-catalogue> dcat:dataset </draft/graph/id/for/datasets/three> . # object is this `</datasets/three>` instead.
  #...
}

Suggested solution

For each batch of data we have loaded into the draftset, we should inspect the statements we loaded to see if they mention any graphs for which we have copies in the draftset e.g.

PREFIX stategraph: <http://publishmydata.com/graphs/drafter/drafts>
PREFIX drafter: <http://publishmydata.com/def/drafter/> .

SELECT DISTINCT ?g { 
     stategraph: { 
        BIND(<http://current/draftset> ?current_ds)
        ?g drafter:inDraftSet ?current_ds .
     } 
     OPTIONAL { graph ?g { ?g ?p ?o } }
     OPTIONAL { graph ?g { ?s ?g ?o } }      
     OPTIONAL { graph ?g { ?s ?p ?g } }
} 

Each graph ?g that is identified should then essentially be cloned into the draftset too; in order to ensure that when within the draft, the joins can happen properly.

The problem with this solution

The main problem with this approach is that graphs which the user does not have real changes against, are in the draftset and will be published. This increases the chances of stomping on other changes which have happened between updates e.g. in the following scenario where two users have drafts:

Improved solution

So the proposed solution is to update drafter to distinguish between "meaningful" changes provided by users, and changes that only exist to preserve the draftset illusion (within the draft). When we clone graphs into the draftset and rewrite them solely because they contain references to graphs, we should record these draft graphs as drafter:hasUserChanges false. When a user later makes changes to those graphs we should update that flag to true, and potentially drop and re-copy that draft graph from live again (rewriting it as we do). The reason we may wish to drop & replace that draft graph is to ensure the in the case of old changes we're not layering changes ontop of old data.