IATI / D-Portal

http://d-portal.org/
Other
30 stars 23 forks source link

UNIDO data not refreshed in d-portal #644

Open sarahshamiso opened 1 year ago

sarahshamiso commented 1 year ago

Was checking some of UNIDO's data in d-portal to see if they had recorded any non-OECD-DAC finance or flow types and found that 2021 and 2022 expenditures are not yet showing up in d-portal even though UNIDO's data was updated on 1 September (https://iatiregistry.org/publisher/unido).

This is the activity: http://d-portal.org/ctrack.html#view=act&aid=XM-DAC-41123-PROJECT-200326

They are in the XML (and also in the CDFD spreadsheets) -- https://www.unido.org/sites/default/files/iati/2022-unido-activities.xml; https://www.unido.org/sites/default/files/iati/2021-unido-activities.xml

I would expect that d-portal refreshes within 24 hours of data being updated in the Registry so wondering if you can look into this issue.

xriss commented 1 year ago

I checked the logs and the id XM-DAC-41123-PROJECT-200326 is used multiple times.

When this happens we pick one of them (the first file alphabetically by iati registry slug) and ignore the rest.

Here are all the datasets containing this ID

https://iatiregistry.org/dataset/unido-activity-2020 https://iatiregistry.org/dataset/unido-activity-2021 https://iatiregistry.org/dataset/unido-activity-2022 https://iatiregistry.org/dataset/unido-activity-2023 https://iatiregistry.org/dataset/unido-activity-2024

UNIDO need to change how they publish this activity before it will work.

sarahshamiso commented 1 year ago

Thanks so much for your quick reply on this and sorry it has taken me so long to follow up.

As far as I understand, UNIDO is not breaking any IATI rules by doing this -- i.e. we are allowing them to do this so seems we should be able to display their data.

I've discovered another example of this today in looking at a major data quality issue in the British Red Cross data (https://github.com/codeforIATI/iati-data-bugtracker/issues/50). It seems that d-portal is picking up the first instance of an activity in a file that contains their historical data, meaning the data in the same activity in the file that contains their updated data is not picked up. Example activity is here: http://d-portal.org/ctrack.html#view=act&aid=GB-CHC-220949-P7803

I'm also wondering how d-portal is pulling in the data from the historical file when there seems to be an issue with it -- https://iatiregistry.org/publisher/gb-chc-220949

It seems we likely need a better solution for this as I suspect that this issue impacts the data of more than these 2 publishers.

markbrough commented 1 year ago

Each IATI Identifier should be globally unique, so you shouldn't use the same IATI Identifier multiple times: https://iatistandard.org/en/guidance/standard-overview/preparing-your-data/activity-information/creating-iati-identifiers/

However, there appear to be 89 publishers that have re-used identifiers: https://analytics.codeforiati.org/identifiers.html

Perhaps we can think of a way of reaching out to these organisations and encouraging them to fix their data?

I think in some cases, there are activities that cover multiple countries, so the publishers has placed the same activity in multiple country files. Instead, the activity should appear only once.

sarahshamiso commented 1 year ago

Yeah I think the guidance on this (at least on this page) isn't abundantly clear -- and the alternative which is to use a new identifier (e.g. one identifier for spending in 2021 and a different one for spending in 2022) creates an even bigger mess for data users to make sense of the data. Seems to me that if this is a rule in the Standard that publishers should be discouraged from publishing using this approach. Unfortunately since this has been allowed, it's quite a hard thing to easily fix as it requires a completely different approach to publishing. Agree it would be good to reach out to publishers on this issue but seems a set of solutions would need to be considered and recommended to these publishers. Tagging @DaveEade49 and @akmiller01

CDFD is able to make sense of this so it seems there could be at least some solution for d-portal -- although much more complicated to do so.

akmiller01 commented 1 year ago

Handling duplicate identifiers would probably be more difficult for d-portal as compared to the CDFD. CDFD is primarily focused on financial data, but d-portal renders every single element in the standard, and there are many elements that cannot be logically merged together in a single-activity display page. D-portal also uses an activity's identifier as a part of the URL, and absent another unique way to identify activities, it's not possible to distinguish between activities with duplicate identifiers.

This might be an optimal place to employ the hierarchy attribute of the iati-activity element, if the publishers want to create unique identifiers, but also group them together under one parent activity. For e.g. they could create one XM-DAC-41123-PROJECT-200326 parent activity, with hierarchy="1" and no attached financial information. And then they could create multiple child activities, with unique identifiers created by suffixing them with the year (e.g. XM-DAC-41123-PROJECT-200326-2020 XM-DAC-41123-PROJECT-200326-2021 XM-DAC-41123-PROJECT-200326-2022 etc.), and then give them hierarchy="2" and related-activity elements with type="1" for "Parent", and ref="XM-DAC-41123-PROJECT-200326", and use those child activities to publish the yearly financial data.