Open markbrough opened 3 years ago
I think you already mentioned elsewhere, but
I suspect most of the file size discrepancy can be attributed to the imf
custom namespace elements that IATI Datastore adds. E.g.:
<imf:usd_transaction xmlns:imf="https://www.imf.org/external/index.htm" currency="USD" exchange_rate="1.14189">
<imf:usd-value>1085236.38</imf:usd-value>
<imf:url>http://www.imf.org/external/np/fin/ert/GUI/Pages/Report.aspx?Type=XML&CU=%27EUR%27,%27JPY%27,%27GBP%27,%27USD%27,%27DZD%27,%27AUD%27,%27ATS%27,%27BHD%27,%27BEF%27,%27VEF%27,%27BWP%27,%27BRL%27,%27BND%27,%27CAD%27,%27CLP%27,%27CNY%27,%27COP%27,%27CYP%27,%27CZK%27,%27DKK%27,%27DEM%27,%27FIM%27,%27FRF%27,%27GRD%27,%27HUF%27,%27ISK%27,%27INR%27,%27IDR%27,%27IRR%27,%27IEP%27,%27ILS%27,%27ITL%27,%27KZT%27,%27KRW%27,%27EEK%27,%27KWD%27,%27LYD%27,%27LUF%27,%27MYR%27,%27MTL%27,%27MUR%27,%27MXN%27,%27NPR%27,%27NLG%27,%27NZD%27,%27NOK%27,%27PEN%27,%27PKR%27,%27UYU%27,%27PHP%27,%27PLN%27,%27PTE%27,%27QAR%27,%27OMR%27,%27RUB%27,%27SAR%27,%27SGD%27,%27SKK%27,%27SIT%27,%27ZAR%27,%27ESP%27,%27LKR%27,%27SEK%27,%27CHF%27,%27THB%27,%27TTD%27,%27TND%27,%27AED%27,%27VEB%27&EX=SDRC&P=DateRange&CF=UnCompressed&CUF=Period&DS=Ascending&DT=NA&Fr=636818976000000000&To=636844896000000000</imf:url>
</imf:usd_transaction>
I think these are added to every transaction and budget element.
Hmm.. I thought your hunch would probably be correct, @andylolz, but there might be something more mysterious going on. Below are some quick statistics for the same Nigeria file.
Test | IATI Datastore | Datastore Classic |
---|---|---|
Activities | 10260 | 10622 |
Transactions | 651768 | 90526 |
There are a small number of activities with a very large number of transactions - each of these 17 activities have over 5000 activities, and as discussed it appears they are missing from Datastore Classic:
['US-GOV-1-720201651382', 'US-GOV-1-720201851533', 'US-GOV-1-720202051401', 'US-GOV-1-720201851382', 'US-GOV-1-720201951382', 'US-GOV-1-720201951533', 'US-GOV-1-720201551574', 'US-GOV-1-720201751533', 'US-GOV-1-720202051552', 'US-GOV-1-720201451533', 'US-GOV-1-720201451382', 'US-GOV-1-720201551382', 'US-GOV-1-720201651533', 'US-GOV-1-720201551533', 'US-GOV-1-720201751382', 'US-GOV-1-720201351533', 'US-GOV-1-720201351382']
Activity US-GOV-1-720202051401
has 9508 transactions!
Haha, you found the jackpot @markbrough we spottedUS-GOV-1-720202051401
before :-)
Transactions 651768 90526
Yeap, lack of data. Nothing new right.. Another reason nobody should be using DSv1 in current shape for anything in a real scenario as data is simply not complete.
- Datastore Classic has 10,622 activities (including 670 version 1.0x activities)
Which means DSv1 actually has 10622-670=9952 activities currently vs 10260 on IATI Datastore.
Another reason nobody should be using DSv1 in current shape for anything in a real scenario as data is simply not complete.
In fact I don’t think this is a bug with DSv1 – I think it’s down to IATI Data Dump, which datastore classic is now set up to use.
The datasets with lots of transactions are all USAID ones, and a number of these datasets are currently failing to get fetched by IATI Data Dump. See the error logs: https://gist.githubusercontent.com/codeforIATIbot/f117c9be138aa94c9762d57affc51a64/raw/errors
I’m currently attempting to resolve this!
Okay, I think I’ve fixed the issue for USAID datasets. So tonight’s datastore import might be bit slower, and the transaction number should (hopefully) be a lot higher by tomorrow.
The datasets that failed to import all have a space in the URL, e.g.: https://s3.amazonaws.com/files.explorer.devtechlab.com/iati-activities-Multiple Countries-11.xml
This space exists in the registry metadata, but it needs to be percent encoded in order for the link to work. Browsers do this automatically, but curl (which IATI data dump uses) doesn’t. So that’s why these USAID datasets didn’t make it in.
With the missing USAID data now imported, this datastore classic query is up to 536.4MB (it’s now a bit weird that it’s so much larger…)
The numbers also look much more closely comparable with the IATI Datastore:
Test | IATI Datastore | Datastore Classic (v2.0x only) | Datastore Classic (all versions) |
---|---|---|---|
Activities | 10,260 | 10,709 | 11,400 |
Transactions | 651,768 | 653,048 | 655,775 |
So it looks as though those missing transactions have been accounted for.
I’ve made a start at comparing the number of activities in datastore classic with the number reported by iatikit. I think it might be possible to branch a few issues off from this, but I’ll just leave this here for now.
Note that datastore classic uses iatikit, so we’d expect these numbers to be similar. They won’t be exactly the same, because datastore classic does more work parsing activities, and could reject an activity because the data is invalid in some way.
At present, datastore classic reports 924,571 activities. iatikit reports 987,226 activities with unique identifiers. So that’s a difference of 62,655 activities. This could be due to problems with parsing, but it is quite a large discrepancy.
Here’s a table of the datasets where the discrepancy in activity count is largest:
Dataset | IATI Classic activity count | iatikit activity count | Explanation |
---|---|---|---|
bmz-activities_based_on_recipient_country | 0 | 7167 | Out of memory error |
cgiar-activities-grants-feb21 | 0 | 3327 | Dataset points to the same resource as cgiar-activities-grants |
unitedstates-998-part-1 | 156 | 3186 | |
unitedstates-998-part-10 | 304 | 3244 | |
jica-loans_2013 | 0 | 2767 | Duplicate IATI identifiers |
unitedstates-998-part-11 | 506 | 3244 | |
unitedstates-998-part-12 | 600 | 3244 | |
unitedstates-998-part-13 | 750 | 3246 | |
unitedstates-mx-part-1 | 816 | 3173 | |
unitedstates-co-part-1 | 620 | 2962 | |
unitedstates-998-part-14 | 977 | 3246 | |
unitedstates-gt-part-1 | 993 | 3164 | |
unitedstates-998-part-15 | 1114 | 3245 | |
unitedstates-998-part-16 | 1249 | 3246 | |
unitedstates-pe-part-1 | 903 | 2875 | |
unitedstates-998-part-17 | 1343 | 3246 | |
unitedstates-hn-part-1 | 956 | 2824 | |
sdc_ch-190311 | 1417 | 3224 | Duplicate IATI identifiers |
unitedstates-998-part-18 | 1498 | 3245 | |
unitedstates-mx-part-2 | 1532 | 3174 | |
unitedstates-998-part-19 | 1652 | 3245 | |
unitedstates-ua-part-1 | 1361 | 2952 | |
sdc_ch-170314 | 1501 | 3074 | Duplicate IATI identifiers |
jica-tc_2012 | 1221 | 2767 | Duplicate IATI identifiers |
unitedstates-998-part-2 | 1735 | 3242 | |
unitedstates-sv-part-1 | 1390 | 2846 | |
jica-789_2015 | 2113 | 3558 | Duplicate IATI identifiers |
afd-global | 1522 | 2933 | Duplicate IATI identifiers |
unitedstates-af-part-1 | 709 | 2088 | |
unitedstates-co-part-2 | 1590 | 2963 | |
unitedstates-pa-part-1 | 1297 | 2664 | |
sdc_ch-150410 | 1366 | 2687 | Duplicate IATI identifiers |
unitedstates-998-part-20 | 1963 | 3245 | |
sdc_ch-201207 | 1373 | 2651 | Duplicate IATI identifiers |
idrccrdi-act2 | 1268 | 2494 | Duplicate IATI identifiers |
unesco-43104 | 719 | 1932 | Duplicate IATI identifiers |
unitedstates-cr-part-1 | 1152 | 2335 | |
sdc_ch-161212 | 2581 | 3714 | Duplicate IATI identifiers |
unitedstates-pk-part-1 | 1244 | 2368 | |
sdc_ch-040324 | 2114 | 3198 | Duplicate IATI identifiers |
unitedstates-998-part-21 | 2188 | 3245 | |
unitedstates-gt-part-2 | 2189 | 3164 | |
unitedstates-ht-part-1 | 1431 | 2351 | |
unitedstates-998-part-22 | 2353 | 3245 | |
unitedstates-pe-part-2 | 1995 | 2874 | |
unitedstates-hn-part-2 | 1947 | 2826 | |
unitedstates-co-part-3 | 2110 | 2965 | |
unitedstates-998-part-3 | 2390 | 3243 | |
unitedstates-mx-part-3 | 2373 | 3173 | |
unitedstates-998-part-4 | 2508 | 3244 | |
unrwa-all | 134 | 798 | |
unitedstates-af-part-2 | 1426 | 2090 | |
jica-389_2014 | 0 | 660 | Duplicate IATI identifiers |
dmru-all | 0 | 631 | Points to the same resource as dmru-999 |
unitedstates-998-part-5 | 2644 | 3244 | |
jica-679_2015 | 1476 | 1971 | Duplicate IATI identifiers |
unitedstates-998-part-6 | 2767 | 3244 | |
jica-679_2014 | 1351 | 1825 | Duplicate IATI identifiers |
unitedstates-998-part-7 | 2869 | 3244 | |
unpf-2015 | 6049 | 6421 | Duplicate IATI identifiers |
pin-dataset-20200930 | 0 | 327 | Out of memory error |
ifrcrcs-act202101 | 8 | 327 | Duplicate IATI identifiers |
ifrcrcs-act202012 | 6 | 319 | Duplicate IATI identifiers |
be-dgd-2021_4 | 8 | 312 | |
pin-dataset-20200630 | 0 | 303 |
Here’s the code I used to generate that:
import iatikit
import requests
discrepancies = []
for d in iatikit.data().datasets:
success = False
while success is False:
try:
r = requests.get('https://datastore.codeforiati.org/api/1/about/dataset/' + d.name)
success = r.status_code == 200
j = r.json()
except:
pass
if j['resources']:
c1 = j['resources'][0]['num_of_activities']
else:
c1 = 0
c2 = d.activities.count()
if c1 != c2:
discrepancies.append((c1, c2, d.name))
discrepancies = sorted(discrepancies, key=lambda x: x[1] - x[0], reverse=True)
for x in discrepancies:
if x[1] - x[0] > 300:
print(x)
I’ve made a start at an "Explanation" column above.
Re-use of IATI identifiers appears to be a common theme:
Publisher | Number of duplicate IATI identifiers |
---|---|
JICA | 8,736 |
Switzerland - SDC | 8,231 |
IFRC | 6,085 |
AFD | 2,969 |
Canada – IDRC | 1,356 |
UNESCO | 1,213 |
UN Pooled Funds | 390 |
Right now, iatikit is reporting 989,421 activities with unique identifiers. Datastore classic is reporting 986,234 activities with unique identifiers. So a difference of just 3,187 activities.
It looks like there was some problem parsing lots of unitedstates datasets, which has now been resolved. The total activities for each of this publisher’s datasets now match up.
It’s a bit difficult to compare accurately, because datastore classic caches results from datasets that are not currently available, whereas IATI Data Dump doesn’t do that (though perhaps it should – there’s a ticket for that here codeforIATI/iati-data-dump#18). So, DS Classic will have some activities that are not in IATI data dump, and IATI data dump may have some activities that are not in DS Classic (e.g. bmz-activities_based_on_recipient_country, which appears to cause memory problems).
There’s also definitely a problem with reporting errors – some stuff ends up in the publicly accessible error logs (/api/1/error/dataset/DATASET/) but not everything. It would be good to make that a bit more consistent.
Spinning off from #173, it would be good to investigate reasons for differences in the total file size of XML data. E.g.: are there activities missing?
Datastore Classic - all activities with recipient-country Nigeria: https://datastore.codeforiati.org/api/1/access/activity.xml?recipient-country=NG&stream=True
IATI Datastore - all activities with recipient-country Nigeria: https://iatidatastore.iatistandard.org/search/activity?q=((recipient_country_code:NG)%20OR%20(transaction_recipient_country_code:NG))&wt=xslt&tr=activity-xml.xsl&rows=100000000