Open timrobertson100 opened 5 months ago
alternatively we could create a new rabbit message, a flushing listener and emit these messages in the clb clis - to be configured with a list of dataset keys to be flushed
If we do the message approach, I'd suggest the subscriber decide if the cache should be flushed, as it's the pipeline that decides if the short circuit IDs should be used, not CLB.
Knowing we're moving off this edition of CLB and it'll all change soon, I wonder if it'd be sufficient to simply flush the cache daily knowing the lookup is now fast so rebuilding is not a big cost. What do you think?
Another options: 1) We can add extra flag when we process dataset - ignore/repopulate cache 2) Add lifetime for a record in the cache There are obvious pros and cons, and questions about data consistency
I wonder if we would not face the same problems in the future too. We will still integrate COL, IUCN, WoRMS and maybe more checklists supplying identifiers. Just through a different system. But we'll be doing this more regulary on a monthly basis, so that might simply be the time to flush and start from scratch without the need to inform about changed lists?
Actually we have the messaging already in place - clb clis emit a ChecklistSyncedMessage once done. The only thing needed would be a listener that knows which datasets to watch out for and then flush the cache.
(Not a pipeline specific issue, but tracking in this repo as it's connected to pipeline function)
Now that the name lookup cache holds decorated records of lookups (e.g. IUCN) and has the scientificNameID etc. short-circuit lookup we need to flush the cache. It has potential to go stale on changes in the IUCN redlist, backbone or any checklist configured to short circuit with ID lookup (e.g. WoRMS LSIDs).
portal-feedback/#5239 is an example of cached responses causing confusion.
I suggest to simply
truncate_preserve
the HBase table weekly (edit: or daily) after verifying that nothing is running.