OneZoom / tree-build

Scripts for assembling the tree, metadata and downstream data products such as popularity and popular images
MIT License
1 stars 2 forks source link

Check for updated images in wikidata JSON #64

Open hyanwong opened 4 days ago

hyanwong commented 4 days ago

It might be useful to know how many taxa have an image change when comparing an older (filtered?) wikidata JSON dump to a new one. This would give us some idea of the "normal" amount of extra harvesting that we would need to doo when we grab a new JSON dump.

I'm assuming there might be a few routines in the wikidata harvesting script that could help with this, @davidebbo, although usually the code compares URLs in the new dump against what we have in the database, rather than what's in an existing file. So perhaps writing a comparison script like this isn't worth it?

davidebbo commented 4 days ago

It would probably be quite easy to write a new script that compares two dumps filtered dumps. I can look into that next week.

Side note: make sure you pull the latest before creating a filtered dump, as I made recent changes to include a few more things that we now need.

hyanwong commented 4 days ago

It would probably be quite easy to write a new script that compares two dumps filtered dumps. I can look into that next week.

Thanks. At the moment the generate_filtered_files script creates a file called e.g. OneZoom_latest-all.json. I wonder if instead, it should create a file called OneZoom_2024-30-04_HH-MM-all.json and a symlink called OneZoom_latest-all.json. That way we don't overwrite the older filtered JSON file.

Alternatively if we find an outdated OneZoom_latest-all.json, we could move the older one to an _oldversion-YYYY-MM-DD version.

davidebbo commented 2 days ago

Yes, it may make sense to rename the old one. We know what date/time to give it because we base its file time stamp on the unfiltered file's.