Open hyanwong opened 4 days ago
It would probably be quite easy to write a new script that compares two dumps filtered dumps. I can look into that next week.
Side note: make sure you pull the latest before creating a filtered dump, as I made recent changes to include a few more things that we now need.
It would probably be quite easy to write a new script that compares two dumps filtered dumps. I can look into that next week.
Thanks. At the moment the generate_filtered_files
script creates a file called e.g. OneZoom_latest-all.json
. I wonder if instead, it should create a file called OneZoom_2024-30-04_HH-MM-all.json
and a symlink called OneZoom_latest-all.json
. That way we don't overwrite the older filtered JSON file.
Alternatively if we find an outdated OneZoom_latest-all.json
, we could move the older one to an _oldversion-YYYY-MM-DD
version.
Yes, it may make sense to rename the old one. We know what date/time to give it because we base its file time stamp on the unfiltered file's.
It might be useful to know how many taxa have an image change when comparing an older (filtered?) wikidata JSON dump to a new one. This would give us some idea of the "normal" amount of extra harvesting that we would need to doo when we grab a new JSON dump.
I'm assuming there might be a few routines in the wikidata harvesting script that could help with this, @davidebbo, although usually the code compares URLs in the new dump against what we have in the database, rather than what's in an existing file. So perhaps writing a comparison script like this isn't worth it?