ThreeSixtyGiving / datagetter

Scripts to download data from http://registry.threesixtygiving.org
MIT License
1 stars 1 forks source link

Add flattened data cache #43

Closed michaelwood closed 1 year ago

michaelwood commented 1 year ago

One of the longest processes we have is flattening very large spreadsheets. However often this spreadsheets do not change on a daily basis. We still need to run the data through the pipeline as-new each day as we want to apply additional_data to the grant data. additional_data can change over time so it is important to re-process this daily.

How I'd envisage this working:

  1. The datagetter GETs the spreadsheet (via the link provided by the registry, existing code)
  2. An MD5 sum of the spreadsheet is generated.
  3. The MD5 sum is looked up in some persistent key:value store (sqlite? json file?, etc?)
  4. IF the key is found, a file location of the JSON version (unflattened) of the spreadsheet is given as the value and we copy that file to the output location. ELSE we unflatten the spreadsheet and copy it to both the output location and a cache location and save the key:value to the key:value store.

Challenges to this are:

Benefits:

Related: https://github.com/ThreeSixtyGiving/datastore/issues/105

michaelwood commented 1 year ago

Possibly going to be looked at by @codemacabre

michaelwood commented 1 year ago

@mariongalley Looks like this feature has halved the datastore's processing time:

mariongalley commented 1 year ago

@michaelwood WOAH - well done team!