When the same activity ID is in multiple files there are multiple problems

odscjames commented 10 months ago

Brief Description Same activity ID is in 2 different activities in 2 different files - many problems!

Severity High

Details 1) If the same activity ID is in 2 different activities in different files, 2 records will appear in the Solr core. This will happen because the Solr ID of the record is calculated by hashing the whole solr record. But even if the 2 activities are exactly the same, the iati_activities_document_id field will be different and thus they will get different Solr Id's.

2) The lakify file name is the hash of the activity id. Thus there is a race condition if the same activity ID is in 2 different activities in different files AND they are processed at the same time. The 2 lakify stages may run and the results of the second one will overwrite the results of the first one. There will still be different Solr records (because the iati_activities_document_id field will be different AND the results of Flatten stage are stored in psql and thus aren't affected by this bug, SO the 2 Solr records will be different and thus still get different id's) but the data in the iati_xml and iati_json fields may not match the other flattened data for one of the records.

[I think - worked out via code reading, not testing]

Expected Results/Behaviour We need to discuss what should happen here; this probably leads to a big piece of work. Guidance is publishers shouldn't do this but many do. Validator should warn users. Datastore should probably only have 1 record per activity id but it will need some way to detect this case and some methodology to pick which one to use. Also, think about how this affects Transactions and Budgets core.

robredpath commented 10 months ago

Thanks for this, @odscjames - what a situation!

My sense is that the first issue introduces friction into the process of consuming IATI data from the Datastore, but accurately reflects the state of the underlying data, and so isn't the worst thing. In saying that, I'm conscious that I'm moving responsibility for dealing with multiple <iati-activity> objects with the same iati-identifier from us to the data user, but that is a fairly straightforward caveat for us. IMO we should document it, offer support to mitigate it, and deal with it in due course. It's particularly gnarly because dealing with that merging is fundamentally an opinionated process, and it's my view that the Datastore should do as little opinionated processing as possible.

The second issue is....pretty bad! Given what I said above, I think we'd do well to find a way to resolve the race condition without needing to resolve the duplicate data problem yet. At least the data would be consistent - which is a good starting point.

Then - yes, we can give some thought to how we design a joined-up solution to this across the standard, tools, comms, support, etc.

odscjames commented 10 months ago

My sense is that the first issue introduces friction into the process of consuming IATI data from the Datastore, .... I'm conscious that I'm moving responsibility .... from us to the data user,

Yeah, a good mitigating action would be to work out some clear advice and methodologies that we can give to data users so we don't just leave them in the dark here.

The second issue is....pretty bad!

I suspect there might be an easy fix here but I need to code read to check - I think if we change the filenames of lakify data to "DOCUMENTID-ACTIVITYID" everything will just sort itself out with no data migration. I think if Solr stage can't find the data (which it wouldn't for old data) it will send it back to be relakified anyway. And lakify files are tagged by meta data for clean up later, so old ones will be deleted eventually.

Also, think about how this affects Transactions and Budgets core.

More I think about it tho, really want to double check that.

odscjames commented 10 months ago

Also, think about how this affects Transactions and Budgets core.

Ok so if the same activity ID is in 2 different activities in different files and those activities both have Budgets/Transactions then I think the Budgets/Transactions will just appear in the core twice.

The ID of the Solr record is based from hashing data (and an idx but that's not relevant here) but again, iati_activities_document_id will be different so we'll get different hashes.

robredpath commented 10 months ago

Great - I think we have a pattern for how we discuss this publicly this already through another bug we found - although we might expect a bit more engagement around this because it affects activity querying.

And it sounds like the easy check is worth at least testing. Great stuff!

odscjames commented 10 months ago

I can confirm bug 2 locally. Process the pipeline, but only for the publisher 'ifrcrcs' and the documents 'ifrcrcs-act202203','ifrcrcs-act202107' Process it one stage at a time - ie do all the Lakify stage, then do all the Solrize stage When processed, query SOLR with:

q=iati_identifier:"XM-DAC-21018-MDRMR011"
fl=transaction_value, iati_json

You'll see the Transaction values are different, but if you copy the JSON values into 2 files and diff them, they are exactly the same!

odscjames commented 10 months ago

Commit linked with fix!

To test, once you have done the stuff in last comment:

Connect to local SQL database and run:

update document set   lakify_start=null, lakify_end=null, lakify_error=null,  solrize_start=null, solrize_end=null, solr_api_error=null;

Run Lakify and Solrize stage, one stage at a time - ie do all the Lakify stage, then do all the Solrize stage

Rerun above SOLR Query and this time the 2 values of iati_json should be different!

(I did test just clearing the solr columns and seeing if the Solr stage would send it back to the Lakify stage correctly. It did - but I realised this is a very bad way of releasing this fix, because the solr stage deletes all records before then, so it means documents will not be available in the SOLR indexes for a while!)

To release:

Stop servers
Release new code
Run above SQL
We may also want to clean out all the contents of the Lakify container in the storage account. The data won't be needed, and if we don't, old data won't be cleaned out until a document changes - so it the short term, the size will double! (EDIT: Doing this here may take a while - maybe we clean out later - we can look at created date on each file to decide whether to delete it or not)
Start servers again.

Still to test: Documents are cleared from container properly, by hash - but if we follow the release procedure in full, we won't need to worry.

odscjames commented 10 months ago

Thanks to @akmiller01 for review - he points out when you look at a URL for an activity it's done by activity id ( https://datastore.iatistandard.org/activity/FR-6-CF-191 ) and it just shows one of them at random. This is related to bug 1 and https://github.com/IATI/datastore-search/blob/develop/src/components/ActivityResult.vue#L55

robredpath commented 10 months ago

Amazing work , @odscjames !

Do we need to notify users of this before release? It sounds like it will affect anyone who routinely uses data by these publishers.

odscjames commented 10 months ago

I think we should notify of this at some point. However, release of this is delayed until we get something else released, so it won't happen quickly.

odscjames commented 10 months ago

Here's a python script to run to clean up the container. https://gist.github.com/odscjames/9488e1b4e1c307915dbb567d690d1227

Shell into one of the containers, apt-get install wget, get raw code and run

odscjames commented 8 months ago

Think can close now.

Issue 1 - there's a much bigger bit of work here (to supply guidance and help/tools to data users to merge activities) that applies across all our tools and I've been wanting to do for ages, so we'll handle elsewhere.

Issue 2 - fully released!

IATI / refresher

When the same activity ID is in multiple files there are multiple problems #300