@lkitching's recent PR #614 improves rewriting data on the way in such that it resolves #607.
However during the review of the #614 PR it was noted there was an additional unconsidered edge case. This edge case is a problem that already existed prior to #614, and has been shown to exist after #614 too, though #614 does fix the initial reported issue and is deemed overall to be an important behavioural fix.
Hypothetical scenario where problem can occur
Imagine users are maintaining for example a dcat catalog of datasets. (NOTE the example here is outside of muttnik's supported dcat profile; but this is drafter so strictly speaking that profile isn't our concern, and it's just a simplified example (omitting dcat:Record), the same issue could happen inside muttnik's AP too).
Imagine the following state is already published to the live site via drafter. NOTE that the imagined pattern in use here is that each datasets data is published into a graph which shares the same URI as the dataset URI, and at least one other graph (here the catalog) contains references to those URIs:
The problem is that our current implementation only rewrites graphs which contain real changes; and doesn't rewrite data/graphs which have no changes in the draftset.
i.e. the following SPARQL query against this draftset (when run with ?union-with-live=true) would fail:
ASK {
?catalog dcat:catalog </datasets/three>
}
Because we haven't also rewritten the catalog graph to include this statement:
</my-catalogue/graph> {
</my-catalogue> dcat:dataset </draft/graph/id/for/datasets/three> . # object is this `</datasets/three>` instead.
#...
}
Suggested solution
For each batch of data we have loaded into the draftset, we should inspect the statements we loaded to see if they mention any graphs for which we have copies in the draftset e.g.
Each graph ?g that is identified should then essentially be cloned into the draftset too; in order to ensure that when within the draft, the joins can happen properly.
The problem with this solution
The main problem with this approach is that graphs which the user does not have real changes against, are in the draftset and will be published. This increases the chances of stomping on other changes which have happened between updates e.g. in the following scenario where two users have drafts:
T1: user1 adds a triple to <datasets/three> graph, this copies that graph into Draft1 with the addition in it, whilst also cloning the shared catalogue graph in order to rewrite the references there too. NOTE the user doesn't have any real changes to the catalogue graph; they're only there to preserve the ability to join over that relationship.
T2 user2 adds a new dataset <datasets/new-dataset> graph and registers it into the catalogue. The shared catalogue graph now contains a draft of new-dataset, and a draft of the catalogue with a new triple registering the new dataset with the catalog.
T3: user2 publishes their draft Draft2, things look fine.
T4: user1 publishes their draft Draft1, the change to the catalog made in T2 is erroneously reverted, though the graph containing <datasets/new-dataset> still exists correctly.
Improved solution
So the proposed solution is to update drafter to distinguish between "meaningful" changes provided by users, and changes that only exist to preserve the draftset illusion (within the draft). When we clone graphs into the draftset and rewrite them solely because they contain references to graphs, we should record these draft graphs as drafter:hasUserChanges false. When a user later makes changes to those graphs we should update that flag to true, and potentially drop and re-copy that draft graph from live again (rewriting it as we do). The reason we may wish to drop & replace that draft graph is to ensure the in the case of old changes we're not layering changes ontop of old data.
@lkitching's recent PR #614 improves rewriting data on the way in such that it resolves #607.
However during the review of the #614 PR it was noted there was an additional unconsidered edge case. This edge case is a problem that already existed prior to #614, and has been shown to exist after #614 too, though #614 does fix the initial reported issue and is deemed overall to be an important behavioural fix.
Hypothetical scenario where problem can occur
Imagine users are maintaining for example a dcat catalog of datasets. (NOTE the example here is outside of muttnik's supported dcat profile; but this is drafter so strictly speaking that profile isn't our concern, and it's just a simplified example (omitting
dcat:Record
), the same issue could happen inside muttnik's AP too).Imagine the following state is already published to the live site via drafter. NOTE that the imagined pattern in use here is that each datasets data is published into a graph which shares the same URI as the dataset URI, and at least one other graph (here the catalog) contains references to those URIs:
Next assume a user wants to add some more data to one of the datasets (
/datasets/three
). They create a new draftset and providing the following quad:This results in the
</datasets/three>
graph being cloned into the draftset, and the quads being rewritten behind the scenes to something like:(TODO replace this with what we really get)
Now there is a problem...
The problem is that our current implementation only rewrites graphs which contain real changes; and doesn't rewrite data/graphs which have no changes in the draftset.
i.e. the following SPARQL query against this draftset (when run with
?union-with-live=true
) would fail:Because we haven't also rewritten the catalog graph to include this statement:
Suggested solution
For each batch of data we have loaded into the draftset, we should inspect the statements we loaded to see if they mention any graphs for which we have copies in the draftset e.g.
Each graph
?g
that is identified should then essentially be cloned into the draftset too; in order to ensure that when within the draft, the joins can happen properly.The problem with this solution
The main problem with this approach is that graphs which the user does not have real changes against, are in the draftset and will be published. This increases the chances of stomping on other changes which have happened between updates e.g. in the following scenario where two users have drafts:
<datasets/three>
graph, this copies that graph intoDraft1
with the addition in it, whilst also cloning the shared catalogue graph in order to rewrite the references there too. NOTE the user doesn't have any real changes to the catalogue graph; they're only there to preserve the ability to join over that relationship.<datasets/new-dataset>
graph and registers it into the catalogue. The shared catalogue graph now contains a draft of new-dataset, and a draft of the catalogue with a new triple registering the new dataset with the catalog.Draft2
, things look fine.Draft1
, the change to the catalog made in T2 is erroneously reverted, though the graph containing<datasets/new-dataset>
still exists correctly.Improved solution
So the proposed solution is to update drafter to distinguish between "meaningful" changes provided by users, and changes that only exist to preserve the draftset illusion (within the draft). When we clone graphs into the draftset and rewrite them solely because they contain references to graphs, we should record these draft graphs as
drafter:hasUserChanges false
. When a user later makes changes to those graphs we should update that flag totrue
, and potentially drop and re-copy that draft graph from live again (rewriting it as we do). The reason we may wish to drop & replace that draft graph is to ensure the in the case of old changes we're not layering changes ontop of old data.