ThreeSixtyGiving / datagetter

Scripts to download data from http://registry.threesixtygiving.org
MIT License
1 stars 1 forks source link

cache: If file remote name changes we get a uniqueness error #48

Closed michaelwood closed 1 year ago

michaelwood commented 1 year ago

When the remote file type changes e.g. from a .csv to .xlsx but all the ids stay the same we end up trying to write a new entry for the cache of the file, this isn't possible as the cache database's constraint says that all the output json files should be unique.

Note: file changed from csv to xlsx

In the cache database:

sqlite> select * from cache where json_file='a003W000007E8cEQAS.json';
a003W000007E8cEQAS.csv|0e0388205cb0be7e6eac0f661b3fd06ca408a8c8|a003W000007E8cEQAS.json
(.ve) datastore-test@360datastore:~/datastore$ datagetter.py --publishers 360G-linbury
Remove existing directory? data y/n: y

Downloading 360Giving Schema...

Schema Download successful.

Fetching https://linburytrust.org.uk/wp-content/uploads/2023/08/TheLinburyTrust_GB-CHC-287077.xlsx
Running convert on data/original/a003W000007E8cEQAS.xlsx to data/json_all/a003W000007E8cEQAS.json

Unflattening failed for file data/original/a003W000007E8cEQAS.xlsx

UNIQUE constraint failed: cache.json_file
Traceback (most recent call last):
  File "/home/datastore-test/datastore/.ve/src/datagetter/getter/get.py", line 237, in fetch_and_convert
    cache.update_cache(
  File "/home/datastore-test/datastore/.ve/src/datagetter/getter/cache.py", line 64, in update_cache
    cur.execute(
sqlite3.IntegrityError: UNIQUE constraint failed: cache.json_file

Short term fix for this is to simply delete the cache entry.

A secondary issue here is that this cache error happens in the same part of the code which handles unflattening (which is what the cache is for) so when an exception is triggered this gets interpreted as a unflattening problem and therefore a problem with the validity of the data itself. Improving the error handling here would be good to aid any future investigations.