Swirrl / datahost-prototypes

Eclipse Public License 1.0
0 stars 0 forks source link

Difftool as transaction log #255

Closed kirahowe closed 1 year ago

kirahowe commented 1 year ago

This development should happen against main and not the (soon to be) frozen ONS branch.

This is an issue to implement Rick's updated delta tool proposal here. This issue is for discussion and improvement leading up to (maybe) implementing this in the future.

There are likely a few issues here to break out if/when we get to a point of implementing this.

1. Accrete changes as one continuous/flat log

All file uploads ("commits") get parsed and added to a flat log of changes on a release, as opposed to grouping them (as one append, one retraction, one correction) per revision. E.g. corrections are not necessarily added verbatim, they would be broken down into the logical operations required to apply them, i.e. the middle two rows here (one append and one delete) make up a correction:

id op dim dim measure revision commit correction_of
af532 DELETE male manchester 10 10 1  
ee123 DELETE male liverpool 15 10 1  
f1acd APPEND male liverpool 23 10 1 ee123
e3754 APPEND male london 300 10 1  

2. Implement new routes to retrieve these transaction logs

All routes to the data will support accessing both schemas/representations of the data via content neg, e.g.

GET /my-data/my-release Accept: text/csv

GET /my-data/my-release Accept: application/x-datahost-tx-csv

Revisions would work the same way GET /my-data/my-release/revision/1 Accept: text/csv GET /my-data/my-release/revision/1 Accept: application/x-datahost-tx-csv

RickMoynihan commented 1 year ago

@kiramclean: I'm not sure 1. is framed quite right as an issue; though it's useful context on what we mean by a TX log, though it's worth noting the format you suggest assumes #253 (which we might not want to do longer term on the main branch).

Essentially 2. is how the API supports reading the TX log 👍 we definitely need that. But we need at least two more issues in my mind to cover how we create that TX log by POSTing TX format data; and then also how we create that TX data/delta.

So let me propose creating the following issues:

  1. Support POST /my-data/my-release Content-Type: text/csv, Accept: application/x-datahost-tx-csv to consult the latest revision/commit and calculate the delta from the whole file provided in user schema. The return value is then in application/x-datahost-tx-csv and should contain just the subsets of rows that are appended/deleted/corrected in TX format. This will depend on deciding the revision commit/change format (for the main branch), but if we're to apply this to the ons branch then on that branch at least it would want to follow this decision here #253. For main we will need to do #256 first. NOTE we don't need a /delta slug on the route because the Accept header identifies it as being a delta.
  2. Support POST /my-data/my-release Content-Type: application/x-datahost-tx-csv for taking a delta (sequence of commits created via 1.) and appending them to the TX log.
  3. The read side (what you called issue 2.). The only point I'd make relating to those is that the GET /my-data/release routes should all redirect to the latest revisions and delegate to them to return the actual data.

I hope that makes sense, I'm not around this afternoon, but can sync up on monday if needed.

RickMoynihan commented 1 year ago

I have broken this issue up into three as described in my previous comment.

The issues are: