gbif / pipelines

Pipelines for data processing (GBIF and LivingAtlases)
Apache License 2.0
40 stars 28 forks source link

Automate a flush of the name lookup cache #1059

Open timrobertson100 opened 5 months ago

timrobertson100 commented 5 months ago

(Not a pipeline specific issue, but tracking in this repo as it's connected to pipeline function)

Now that the name lookup cache holds decorated records of lookups (e.g. IUCN) and has the scientificNameID etc. short-circuit lookup we need to flush the cache. It has potential to go stale on changes in the IUCN redlist, backbone or any checklist configured to short circuit with ID lookup (e.g. WoRMS LSIDs).

portal-feedback/#5239 is an example of cached responses causing confusion.

I suggest to simply truncate_preserve the HBase table weekly (edit: or daily) after verifying that nothing is running.

mdoering commented 5 months ago

alternatively we could create a new rabbit message, a flushing listener and emit these messages in the clb clis - to be configured with a list of dataset keys to be flushed

timrobertson100 commented 5 months ago

If we do the message approach, I'd suggest the subscriber decide if the cache should be flushed, as it's the pipeline that decides if the short circuit IDs should be used, not CLB.

Knowing we're moving off this edition of CLB and it'll all change soon, I wonder if it'd be sufficient to simply flush the cache daily knowing the lookup is now fast so rebuilding is not a big cost. What do you think?

muttcg commented 5 months ago

Another options: 1) We can add extra flag when we process dataset - ignore/repopulate cache 2) Add lifetime for a record in the cache There are obvious pros and cons, and questions about data consistency

mdoering commented 5 months ago

I wonder if we would not face the same problems in the future too. We will still integrate COL, IUCN, WoRMS and maybe more checklists supplying identifiers. Just through a different system. But we'll be doing this more regulary on a monthly basis, so that might simply be the time to flush and start from scratch without the need to inform about changed lists?

Actually we have the messaging already in place - clb clis emit a ChecklistSyncedMessage once done. The only thing needed would be a listener that knows which datasets to watch out for and then flush the cache.