codeforIATI / iati-datastore

An open-source datastore for IATI data with RESTful web API providing XML, JSON, CSV plus ETL tools
https://datastore.codeforiati.org
Other
1 stars 1 forks source link

Review differences with other Datastores #185

Open markbrough opened 3 years ago

markbrough commented 3 years ago

Spinning off from #173, it would be good to investigate reasons for differences in the total file size of XML data. E.g.: are there activities missing?

Datastore Classic - all activities with recipient-country Nigeria: https://datastore.codeforiati.org/api/1/access/activity.xml?recipient-country=NG&stream=True

IATI Datastore - all activities with recipient-country Nigeria: https://iatidatastore.iatistandard.org/search/activity?q=((recipient_country_code:NG)%20OR%20(transaction_recipient_country_code:NG))&wt=xslt&tr=activity-xml.xsl&rows=100000000

Service Time Size
IATI Datastore 44s 319.44MB
Datastore Classic 24s 106.73MB
andylolz commented 3 years ago

I think you already mentioned elsewhere, but

andylolz commented 3 years ago

I suspect most of the file size discrepancy can be attributed to the imf custom namespace elements that IATI Datastore adds. E.g.:

<imf:usd_transaction xmlns:imf="https://www.imf.org/external/index.htm" currency="USD" exchange_rate="1.14189">
  <imf:usd-value>1085236.38</imf:usd-value>
  <imf:url>http://www.imf.org/external/np/fin/ert/GUI/Pages/Report.aspx?Type=XML&amp;CU=%27EUR%27,%27JPY%27,%27GBP%27,%27USD%27,%27DZD%27,%27AUD%27,%27ATS%27,%27BHD%27,%27BEF%27,%27VEF%27,%27BWP%27,%27BRL%27,%27BND%27,%27CAD%27,%27CLP%27,%27CNY%27,%27COP%27,%27CYP%27,%27CZK%27,%27DKK%27,%27DEM%27,%27FIM%27,%27FRF%27,%27GRD%27,%27HUF%27,%27ISK%27,%27INR%27,%27IDR%27,%27IRR%27,%27IEP%27,%27ILS%27,%27ITL%27,%27KZT%27,%27KRW%27,%27EEK%27,%27KWD%27,%27LYD%27,%27LUF%27,%27MYR%27,%27MTL%27,%27MUR%27,%27MXN%27,%27NPR%27,%27NLG%27,%27NZD%27,%27NOK%27,%27PEN%27,%27PKR%27,%27UYU%27,%27PHP%27,%27PLN%27,%27PTE%27,%27QAR%27,%27OMR%27,%27RUB%27,%27SAR%27,%27SGD%27,%27SKK%27,%27SIT%27,%27ZAR%27,%27ESP%27,%27LKR%27,%27SEK%27,%27CHF%27,%27THB%27,%27TTD%27,%27TND%27,%27AED%27,%27VEB%27&amp;EX=SDRC&amp;P=DateRange&amp;CF=UnCompressed&amp;CUF=Period&amp;DS=Ascending&amp;DT=NA&amp;Fr=636818976000000000&amp;To=636844896000000000</imf:url>
</imf:usd_transaction>

I think these are added to every transaction and budget element.

markbrough commented 3 years ago

Hmm.. I thought your hunch would probably be correct, @andylolz, but there might be something more mysterious going on. Below are some quick statistics for the same Nigeria file.

Test IATI Datastore Datastore Classic
Activities 10260 10622
Transactions 651768 90526
markbrough commented 3 years ago

There are a small number of activities with a very large number of transactions - each of these 17 activities have over 5000 activities, and as discussed it appears they are missing from Datastore Classic:

['US-GOV-1-720201651382', 'US-GOV-1-720201851533', 'US-GOV-1-720202051401', 'US-GOV-1-720201851382', 'US-GOV-1-720201951382', 'US-GOV-1-720201951533', 'US-GOV-1-720201551574', 'US-GOV-1-720201751533', 'US-GOV-1-720202051552', 'US-GOV-1-720201451533', 'US-GOV-1-720201451382', 'US-GOV-1-720201551382', 'US-GOV-1-720201651533', 'US-GOV-1-720201551533', 'US-GOV-1-720201751382', 'US-GOV-1-720201351533', 'US-GOV-1-720201351382']
markbrough commented 3 years ago

Activity US-GOV-1-720202051401 has 9508 transactions!

siemvaessen commented 3 years ago

Haha, you found the jackpot @markbrough we spottedUS-GOV-1-720202051401before :-)

siemvaessen commented 3 years ago

Transactions 651768 90526

Yeap, lack of data. Nothing new right.. Another reason nobody should be using DSv1 in current shape for anything in a real scenario as data is simply not complete.

siemvaessen commented 3 years ago
  • Datastore Classic has 10,622 activities (including 670 version 1.0x activities)

Which means DSv1 actually has 10622-670=9952 activities currently vs 10260 on IATI Datastore.

andylolz commented 3 years ago

Another reason nobody should be using DSv1 in current shape for anything in a real scenario as data is simply not complete.

In fact I don’t think this is a bug with DSv1 – I think it’s down to IATI Data Dump, which datastore classic is now set up to use.

The datasets with lots of transactions are all USAID ones, and a number of these datasets are currently failing to get fetched by IATI Data Dump. See the error logs: https://gist.githubusercontent.com/codeforIATIbot/f117c9be138aa94c9762d57affc51a64/raw/errors

I’m currently attempting to resolve this!

andylolz commented 3 years ago

Okay, I think I’ve fixed the issue for USAID datasets. So tonight’s datastore import might be bit slower, and the transaction number should (hopefully) be a lot higher by tomorrow.


The datasets that failed to import all have a space in the URL, e.g.: https://s3.amazonaws.com/files.explorer.devtechlab.com/iati-activities-Multiple Countries-11.xml

This space exists in the registry metadata, but it needs to be percent encoded in order for the link to work. Browsers do this automatically, but curl (which IATI data dump uses) doesn’t. So that’s why these USAID datasets didn’t make it in.

andylolz commented 3 years ago

With the missing USAID data now imported, this datastore classic query is up to 536.4MB (it’s now a bit weird that it’s so much larger…)

The numbers also look much more closely comparable with the IATI Datastore:

Test IATI Datastore Datastore Classic (v2.0x only) Datastore Classic (all versions)
Activities 10,260 10,709 11,400
Transactions 651,768 653,048 655,775

So it looks as though those missing transactions have been accounted for.

andylolz commented 3 years ago

I’ve made a start at comparing the number of activities in datastore classic with the number reported by iatikit. I think it might be possible to branch a few issues off from this, but I’ll just leave this here for now.

Note that datastore classic uses iatikit, so we’d expect these numbers to be similar. They won’t be exactly the same, because datastore classic does more work parsing activities, and could reject an activity because the data is invalid in some way.

At present, datastore classic reports 924,571 activities. iatikit reports 987,226 activities with unique identifiers. So that’s a difference of 62,655 activities. This could be due to problems with parsing, but it is quite a large discrepancy.

Here’s a table of the datasets where the discrepancy in activity count is largest:

Dataset IATI Classic activity count iatikit activity count Explanation
bmz-activities_based_on_recipient_country 0 7167 Out of memory error
cgiar-activities-grants-feb21 0 3327 Dataset points to the same resource as cgiar-activities-grants
unitedstates-998-part-1 156 3186
unitedstates-998-part-10 304 3244
jica-loans_2013 0 2767 Duplicate IATI identifiers
unitedstates-998-part-11 506 3244
unitedstates-998-part-12 600 3244
unitedstates-998-part-13 750 3246
unitedstates-mx-part-1 816 3173
unitedstates-co-part-1 620 2962
unitedstates-998-part-14 977 3246
unitedstates-gt-part-1 993 3164
unitedstates-998-part-15 1114 3245
unitedstates-998-part-16 1249 3246
unitedstates-pe-part-1 903 2875
unitedstates-998-part-17 1343 3246
unitedstates-hn-part-1 956 2824
sdc_ch-190311 1417 3224 Duplicate IATI identifiers
unitedstates-998-part-18 1498 3245
unitedstates-mx-part-2 1532 3174
unitedstates-998-part-19 1652 3245
unitedstates-ua-part-1 1361 2952
sdc_ch-170314 1501 3074 Duplicate IATI identifiers
jica-tc_2012 1221 2767 Duplicate IATI identifiers
unitedstates-998-part-2 1735 3242
unitedstates-sv-part-1 1390 2846
jica-789_2015 2113 3558 Duplicate IATI identifiers
afd-global 1522 2933 Duplicate IATI identifiers
unitedstates-af-part-1 709 2088
unitedstates-co-part-2 1590 2963
unitedstates-pa-part-1 1297 2664
sdc_ch-150410 1366 2687 Duplicate IATI identifiers
unitedstates-998-part-20 1963 3245
sdc_ch-201207 1373 2651 Duplicate IATI identifiers
idrccrdi-act2 1268 2494 Duplicate IATI identifiers
unesco-43104 719 1932 Duplicate IATI identifiers
unitedstates-cr-part-1 1152 2335
sdc_ch-161212 2581 3714 Duplicate IATI identifiers
unitedstates-pk-part-1 1244 2368
sdc_ch-040324 2114 3198 Duplicate IATI identifiers
unitedstates-998-part-21 2188 3245
unitedstates-gt-part-2 2189 3164
unitedstates-ht-part-1 1431 2351
unitedstates-998-part-22 2353 3245
unitedstates-pe-part-2 1995 2874
unitedstates-hn-part-2 1947 2826
unitedstates-co-part-3 2110 2965
unitedstates-998-part-3 2390 3243
unitedstates-mx-part-3 2373 3173
unitedstates-998-part-4 2508 3244
unrwa-all 134 798
unitedstates-af-part-2 1426 2090
jica-389_2014 0 660 Duplicate IATI identifiers
dmru-all 0 631 Points to the same resource as dmru-999
unitedstates-998-part-5 2644 3244
jica-679_2015 1476 1971 Duplicate IATI identifiers
unitedstates-998-part-6 2767 3244
jica-679_2014 1351 1825 Duplicate IATI identifiers
unitedstates-998-part-7 2869 3244
unpf-2015 6049 6421 Duplicate IATI identifiers
pin-dataset-20200930 0 327 Out of memory error
ifrcrcs-act202101 8 327 Duplicate IATI identifiers
ifrcrcs-act202012 6 319 Duplicate IATI identifiers
be-dgd-2021_4 8 312
pin-dataset-20200630 0 303
andylolz commented 3 years ago

Here’s the code I used to generate that:

import iatikit
import requests

discrepancies = []
for d in iatikit.data().datasets:
    success = False
    while success is False:
        try:
            r = requests.get('https://datastore.codeforiati.org/api/1/about/dataset/' + d.name)
            success = r.status_code == 200
            j = r.json()
        except:
            pass
    if j['resources']:
        c1 = j['resources'][0]['num_of_activities']
    else:
        c1 = 0
    c2 = d.activities.count()
    if c1 != c2:
        discrepancies.append((c1, c2, d.name))

discrepancies = sorted(discrepancies, key=lambda x: x[1] - x[0], reverse=True)
for x in discrepancies:
    if x[1] - x[0] > 300:
        print(x)
andylolz commented 3 years ago

I’ve made a start at an "Explanation" column above.

Re-use of IATI identifiers appears to be a common theme:

Publisher Number of duplicate IATI identifiers
JICA 8,736
Switzerland - SDC 8,231
IFRC 6,085
AFD 2,969
Canada – IDRC 1,356
UNESCO 1,213
UN Pooled Funds 390
andylolz commented 3 years ago

Right now, iatikit is reporting 989,421 activities with unique identifiers. Datastore classic is reporting 986,234 activities with unique identifiers. So a difference of just 3,187 activities.

It looks like there was some problem parsing lots of unitedstates datasets, which has now been resolved. The total activities for each of this publisher’s datasets now match up.

It’s a bit difficult to compare accurately, because datastore classic caches results from datasets that are not currently available, whereas IATI Data Dump doesn’t do that (though perhaps it should – there’s a ticket for that here codeforIATI/iati-data-dump#18). So, DS Classic will have some activities that are not in IATI data dump, and IATI data dump may have some activities that are not in DS Classic (e.g. bmz-activities_based_on_recipient_country, which appears to cause memory problems).

There’s also definitely a problem with reporting errors – some stuff ends up in the publicly accessible error logs (/api/1/error/dataset/DATASET/) but not everything. It would be good to make that a bit more consistent.