Implement Efficient Publishing in Drafter

ricroberts commented 6 years ago

In order to:

make publishes more efficient, i.e. O(1) publish.
let people warm up the cache while they're drafting.

We should change drafter so that:

We keep the external API the same
We use secret named graphs (as storage graphs) for every endpoint (not just draftsets)
Rewrite on every endpoint (not just draftsets)
Endpoints relate to graph-sets
Graph-sets are mappings of named graphs to storage graphs
We avoid moving graphs around on publish (we just change a small number of triples in the state graph).
We calculate the stasher cache keys based on the contents of the graph-set, rather than the drafter/endpoint name
We deal with any changes to how conflicts happen / are resolved due to the process changing.
MIgrate (or use reasoning to manage) changes to the state graph structure

As part of designing/ implementing this ticket, we should also consider/design how role based permissions would interact with this, so we know we can add it later without too much rework.

ricroberts commented 6 years ago

Suggested initial implementation:

All data in the database is stored in (randomly named) storage-graphs
endpoints (called draftsets in the current api) have a name, description etc.
Use graph-sets to collect storage graphs together into 'union graphs' used by endpoints
graph-sets also provide public names for the graphs in that union graph.
To maintain compatibility with the current API we can have a special 'live' endpoint and new endpoints can only be started/branched from Live.

Imagine a live endpoint with 3 publicly named graphs, A, B, and C:

  live (L)
   |
  ABC

The graphset for this endpoint maps named graphs to storage graphs

  A -> S1 
  B -> S2
  C -> S3

When we query the live endpoint with the public named graphs, it will rewrite the queries and results so we actually query the storage graphs.

Scenario 1

Now imagine a scenario where someone makes a change to Graph A (e.g. by appending some data)

   L  Draftset1
   ^  ^
   |  |
   |  /
   |/ 
  ABC    A'BC

As we do in the current version, we make a copy of A at the point of appending the new data into a new storage graph.

The graphset for this draftset

  A -> S4
  B -> S2
  C -> S3

To publish the change to this draftset to live, we just update the graphset used for Live to be the above. No copying of data is required

   L 
   ^  
   |
   A'BC
   |\  
   | \  
   |  |
   |  /
   |/ 
  ABC

After the change, Live, has the changed verion of Graph A

Scenario 2

Now imagine that this scenario is made more complicated by someone else making a change to graph B in Draftset2 shortly after Draftset 1 was created:

Draftset2    L    Draftset1
          ^  ^  ^
          |  |  | 
    AB'C   \ |  |
            \|  / A'BC
             |/ 
            ABC

The graphset for Live

  A -> S1
  B -> S2
  C -> S3

The graphset for Draftset1

  A -> S4
  B -> S2
  C -> S3

The graphset for Draftset2

  A -> S1
  B -> S5
  C -> S3

Depending on who merges/publishes first to Live, then Live will either miss the changes for Graph B, or the changes for Graph A.

There are 2 options for resolving conflicts:

OPTION 1) We could keep the client behaviour as it is now, by making graph sets inherit/cascade non-changed graph mappings from their parent endpoint's graphset (i.e. Live in our case).

The disadvantage of this is that (like now) changes can be silently inherited from live which might break your draftset and you not notice.

This will mean that if Drafset 2 was pubished first:

     L    Draftset1
     ^  ^
     |  |
  AB'C  |  A'B'C
    /|  |
   / |  |
  |  |  | 
   \ |  |
    \|  /  A'BC
     |/ 
    ABC

Then Draftset 1 would inherit the change to Graph B

The graphset for Draftset1 would become:

  A -> S4
  B -> S5
  C -> S3

Then when we publish/merge/apply Draftset1 into Live, we don't lose the changes from Draftset 2:

  D2 L D1    
     ^  
     |
    A'B'C'        No changes lost
     |\
     | \
     |  |
  AB'C  |  A'B'C  Changes inherited from Draftset 2
    /|  |
   / |  |
  |  |  | 
   \ |  |
    \|  /  A'BC
     |/ 
    ABC

An enhancement to option 1 would be to warn users in draftset1 that they had inherited the changes from Live, before they publish, so they they can check things.

OPTION 2)

Publishing a draftset (merging an endpoint) means that all the graph mappings in the graph-set are applied to Live.
When users are working in a draftset/endpoint branched off Live, warn them of changes made to the parent/target endpoint (Live in our case), that would be lost if this draftset were to be published.
As an MVP, we could just warn users who can manually resolve these conflicts. This could just happen via a banner/alert in the admin panel, and we could list the 'conflicts' on the draftset page.
As an enhancement we could provide tools/options to copy changed graphs from the parent endpoint's graph-set into theirs.
We thought about conflict notification, in the past. See: https://github.com/Swirrl/drafter/blob/master/doc/conflict-resolution.org#proposal-conflict-notifications

With MVP:

Conflict:

     L    Draftset1
     ^  ^
     |  |
  AB'C  |  A'BC (warn!)
    /|  |
   / |  |
  |  |  | 
   \ |  |
    \|  /  A'BC
     |/ 
    ABC

Publish:

     L    Draftset1
     ^  
     |
     AB'C     Lose changes to B! (unless user manually fixes up their draftset)
     |\
     | \
     |  |
  AB'C  |  A'BC (warn!)
    /|  |
   / |  |
  |  |  | 
   \ |  |
    \|  /  A'BC
     |/ 
    ABC

With Enhancement

Conflict:

     L    Draftset1
     ^  ^
     |  |
     | /|  A'B'C (user chooses to copy changes to B from Live)
     |/ |
  AB'C  |  A'BC  (warn!)
    /|  |
   / |  |
  |  |  | 
   \ |  |
    \|  /  A'BC
     |/ 
    ABC

Publish:

     L    Draftset1
     ^  
     |
    A'B'C  Merged! :)
     |\     
     | \
     | /|  A'B'C (user chooses to copy changes to B from Live)
     |/ |
  AB'C  |  A'BC (warn!)
    /|  |
   / |  |
  |  |  | 
   \ |  |
    \|  /  A'BC
     |/ 
    ABC

Option 1 (especially without the enhancement) requires no changes to the API (or data returned), or the clients (PMD).

Overall, I think I prefer option 2, as its more predictable. But it's a change (we could maybe do this at a later date?)

TODO: figure out how we model this all in RDF. What history / audit trail do we want to keep? How do we garbage collect unused storage graphs.

RickMoynihan commented 6 years ago

Firstly the write up looks great. +1 for the ascii art branch diagrams! :-)

There is a small problem with this statement on merge semantics:

Then when we publish/merge/apply Draftset1 into Live, we don't lose the changes from Draftset 2:

"changes" also means being clear about the handling of DELETEs, and I don't think you've considered this. I think you meant the weaker statement "we don't lose the APPENDs from Draftset 2".

Specifically I think the MVP we've been describing has just been a merge strategy of "all theirs, all ours, or UNIONing the graph" on a graph by graph basis. I think for an MVP this is ok, so long as users understand that DELETEs will get stomped, as we have no record of them.

A more complete handling of conflict involves storing the sequence of APPEND/DELETE operations inside an RDF Patch/Delta, and letting users resolve the conflict by specifying the order of these operations at merge time. Once we know the order of changes we can offer various levels of merge granularity, providing much more precise mechanisms for merging.

My proposal would be to leave the RDF patch/log implementation till later, but it's a feature that I think would unlock a lot of future capabilities. Including improving our HA story.

ricroberts commented 6 years ago

Yeah i didn't consider deletes specifically but i think if we're doing it as graph (or whole endpoint) -granularity then it will just work

But yeah, agree we should keep it simple in v1 and prob just stick with simulating the current behaviour.

RickMoynihan commented 6 years ago

See here for the proposed data model as an example trig file and here for the supporting vocabulary

ricroberts commented 3 years ago

I had a thought about this recently. And this might be obvious to others but we could achieve most of the benefits here by making changes so that:

'live' is still a special endpoint used for the public site
use 'storage graphs' for all drafsetts/endpoints (including live)
all endpoints (including live) use some mappings in the drafter state graph to relate public graph names to the storage graph names.
On publish, mutate the mapping for the live endpoint so that we now have a different set of published graphs. (Note that will simulate the current behaviour of MOVE-ing data from the draft graphs into the live graphs and therefore have the same merge behaviour of 'last change wins, per graph').

This is the minimal change we could make which would let us do instant publishing.

Added extras (not strictly required for instant publishing)

If we use the set of storage graphs in an endpoint as the cache key, this would let you warm up the cache in a draft.
Add a way to preview what a set of storage graphs together would look like (i.e. test out what merging multiple draft sets might look like). This might just mean making a new draftset/endpoint with the contents, a bit like an 'integration branch'.
Conflict detection (related to the previous bullet) or which drafts affect the same graph, or if someone else has recently published a change to a graph that you've also changed.
Adding a way to allow multiple users read-only access to a draftset.

Swirrl / drafter