ETL Process only that data which has changed since the last run

Robsteranium commented 3 years ago

Instead of processing all of the data each time, we might want to update only those resources that have changed.

We could look to run this on an e.g. daily basis or somehow hook into drafter/ jenkins to trigger re-loads.

We'd need to decide the granularity (e.g. graph, dataset or individual resources) and some way of tracking change (e.g. _meta fields on each document or a "graphs" index etc).

Robsteranium commented 3 years ago

Drafter's graph-modified-times graph tells us when each graph was last modified.

If we record the time when the pipeline was run (e.g. in another index), we can run a sparql query against the graph-modified-times graph to find which public graphs have changed since the last run. We can then then pass these as named-graph constraints to queries (subject-pages and construct) in the etl pipeline.

The semantics of the elasticsearch loading means this would give us an upsert (i.e. if a document with the given @id exists then update it otherwise create it).

In order to propagate resource deletions, we will need to track the graph for each resource. Instead of an upsert, the sync process would delete modified graphs and insert their contents. Drafter will record the modified times for delete graphs indefinitely so this ensures we too will delete those documents.

This will lead to unnecessary deletion/ re-insertion of unchanged resources but it obviates the need to compare the index with the triple store.

For observations, codes, and components it should be safe to assume that one graph applies to the whole resource - i.e. wrapping each query in a GRAPH ?g { } clause attaching it to the object with ook:graph or possibly using json-ld's @graph property.

For datasets, there will be two graphs - one for the entry and another for the cube. It would be possible to modify these independently (in most cases jenkins will probably modify both at once but we can't guarantee this, indeed certainly not for PMD admin). Instead of letting ES treat these as a vector property (i.e. which wouldn't distinguish which triples came from which graph) we should probably reorganise the dataset documents.

Option 1: Rather than lifting the cube property of "components" up to the entry we could create two documents (either in their own index or as two types in one index) instead - one for the entry and another for the cube. The former would simply point at the latter. This would make syncing easy (i.e. the same as for the other indicies) but would require a join when constructing dataset objects.

Option 2: A more elasticsearch-friendly approach would be to nest the cube inside the entry with the outer and inner document having different @graph properties. This would obviate the need to join these. The deletion for entries would need to leave the sub-document intact, likewise for cubes it shouldn't delete the super-document. Where both graphs were affected we'd need to ensure the whole document - entry and cube - were removed (and not have remnants caught with reciprocal dependencies!). Similarly the insertion for either would need to find the right document to update. Indeed, worse still, it might need to change the @id if the dataset or entries URI changed! We'd have to look the document up by the @graph (which itself would need be excluded from deletion). This would of course be problematic where more than one document shared a graph but should be ok for datasets (which ought to be 1:n with graph - i.e. a cube can be split across graphs but a graph shouldn't contain more than one cube or entry).

Given the fact that the number of datasets is relatively small (less than 150 on staging at the moment, only 15 for trade), I'm inclined to go with option 1 - make the updates easier at the expense of needing some trivial-to-compute joins.

Robsteranium commented 3 years ago

It looks like the drafter feature is nearly ready but won't land in time for us to start building on it ahead of the end-of-March deadline.

We can return to this later.

Swirrl / ook

ETL Process only that data which has changed since the last run #17