codeforIATI / iati-data-dump

đź“· A daily snapshot of all IATI data on the IATI Registry
https://iati-data-dump.codeforiati.org
GNU General Public License v3.0
2 stars 1 forks source link

Dump contains old files that should not exist/should be replaced. #7

Closed sylvanr closed 1 year ago

sylvanr commented 1 year ago

Describe the bug

The iati-data-main.zip file contains datasets that are no longer available.

Here is a prime example: FCDO Zimbabwe is currently unavailable However, if you download the latest iati-data-main.zip, the original .xml file with its data is available there. It does show up in the error log as not available.

This means that the data still enters our processing pipeline while not actually available.

Oddly enough the validation status on the registry has not changed from "error" to "not found" registry fcdo zw link, and the metadata seemingly has not updated to reflect the new file content.

But it does not seem like it should be possible for the data dump to contain the old files (what if a publisher needs to redact its files like we saw with the Afghanistan situation).

Hopefully we can quickly resolve this! Happy to chat about it.

sylvanr commented 1 year ago

In case the example dataset gets privated: The live dataset Screenshot from 2023-02-09 11-49-36

The downloaded dataset from the most recent dump Screenshot from 2023-02-09 11-50-42

andylolz commented 1 year ago

Thanks for raising this @sylvanr.

IATI Data Dump works this way by design.

It works this way because datasets are sometimes inaccessible when crawled. This can happen for various reasons, including server errors. So rather than all of a dataset’s activities dropping from the dump due to a server flicker, we attempt to cache data where it appears the publisher did not intend to remove it.

If a publisher does need to redact data, IATI provides the following guidance: https://iatistandard.org/en/data-removal/

If this guidance is followed, data will be automatically removed from IATI data dump. Additionally, if individual activities (or all activities) are removed from a dataset, they’ll be automatically removed from IATI data dump.

Changing the server response code for a dataset is not an approach recommended by IATI, and it won’t cause data to be removed from IATI data dump.

sylvanr commented 1 year ago

Thanks for the rapid response @andylolz ! This totally makes sense, will pass this on to the publisher in this case, and keep it in mind for the future!

andylolz commented 1 year ago

In case it’s helpful: if you do want to remove a dataset in a way that goes beyond the IATI data removal guidance, you could check the error log (as you have done here) and remove it from the dump depending on the server response.

sylvanr commented 1 year ago

Thanks! That is helpful yes, definitely something to consider. But also good to make sure that everything is used as intended also on the publisher's end.

xriss commented 1 year ago

FYI

D-Portal also uses last seen data in the case of server errors.

However, currently it is ignoring all the data in this file as it is full of duplicate activities, which is presumably why it needs deleting.

I thought something was going strange until I released it was all duplicates.

andylolz commented 1 year ago

Aha – thanks for clarifying that, @xriss. I had thought d-portal worked the same way, but when I checked I could see the logs for the dataset, but couldn’t find any activities attributed to it. This explains why!