Add flattened data cache

michaelwood commented 1 year ago

One of the longest processes we have is flattening very large spreadsheets. However often this spreadsheets do not change on a daily basis. We still need to run the data through the pipeline as-new each day as we want to apply additional_data to the grant data. additional_data can change over time so it is important to re-process this daily.

How I'd envisage this working:

The datagetter GETs the spreadsheet (via the link provided by the registry, existing code)
An MD5 sum of the spreadsheet is generated.
The MD5 sum is looked up in some persistent key:value store (sqlite? json file?, etc?)
IF the key is found, a file location of the JSON version (unflattened) of the spreadsheet is given as the value and we copy that file to the output location. ELSE we unflatten the spreadsheet and copy it to both the output location and a cache location and save the key:value to the key:value store.

Challenges to this are:

The process needs to be thread safe, as each GET/unflatten currently happens in a separate thread
The output needs continue to have the same folder structure to be compatible with the datastore loader
We will probably end up with orphaned key:value pairs. Not a huge problem as it's very small but a cleaning up process could be thought about e.g. remove all keys which weren't accessed.

Benefits:

Huge speed improvement once initial load has been done, we'll only end up processing files which have changed or are new.

michaelwood commented 1 year ago

Possibly going to be looked at by @codemacabre

michaelwood commented 1 year ago

@mariongalley Looks like this feature has halved the datastore's processing time:

data_2023-03-22.tar.gz 2023-03-22 06:26 (before)
data_2023-03-23.tar.gz 2023-03-23 03:20 (after)

mariongalley commented 1 year ago

@michaelwood WOAH - well done team!

ThreeSixtyGiving / datagetter

Add flattened data cache #43