codeforIATI / iati-ideas

💡 Ideas for new codeforIATI projects and blogs
https://ideas.codeforiati.org
0 stars 0 forks source link

[PROJECT IDEA] Flattened document-link #26

Open stevieflow opened 3 years ago

stevieflow commented 3 years ago

Rationale

There are thousands of document-links available via IATI data. These all have a range of metadata (both via the document-link directly, but also the parent iati-activity), which might be useful for people to access, for a variety of reasons

Proposal

A service (perhaps built off Classic + iati-flattener: https://github.com/iati-data-access/iati-flattener/issues/1?) to output a list of document-link items. This would pull in relevant data from both the specific document-link element, but also elements such as reporting-org (name; ref; type); recipient-country; sector(s); activity-status.

Users could query and get this list in spreadsheet, JSON and XML format

notshi commented 3 years ago

Hi @stevieflow, there are 1,966,749 document-links published across all activities, of which 392,533 are unique urls so this is a massive query.

This started from me wanting to find out if this was possible via dquery so hope it's ok to share!

Getting a list of flattened document-links is expensive but doable. Here is the first 100 results of flattened document-link only,'%20')%20%20)as%20category,%0Amax(xson%20-%3E%20'/language'-%3E0-%3E%3E'@code')%20as%20lang,%0Amax(xson%20-%3E%3E%20'/document-date@iso-date')%20as%20isodate,%0Amax(xson%20-%3E%3E%20'@format')%20as%20file,%0Axson%20-%3E%3E%20'@url'%20as%20url%0A%0Afrom%20xson%20where%20root='/iati-activities/iati-activity/document-link'%0A%0Agroup%20by%20url%0Aorder%20by%20url%0A%0Alimit%20100%20offset%200;%0A).

However, including iati-activity level elements slowed the query down to the point that I don't think is practical. In fact, we did some tests and it was not possible to download all the data so much so that it would probably be better to run code over all the data and bypass a database entirely.

Here is the first 100 results including iati-acitivity elements,'%20')%20%20as%20category,%0Ax1.xson%20-%3E%20'/language'-%3E0-%3E%3E'@code'%20as%20lang,%0Ax1.xson%20-%3E%3E%20'/document-date@iso-date'%20as%20isodate,%0Ax1.xson%20-%3E%3E%20'@format'%20as%20file,%0Ax1.xson%20-%3E%3E%20'@url'%20as%20url,%0A%0Ax2.xson%20-%3E%20'/recipient-country'-%3E0-%3E%3E'@code'%20as%20recipient,%0Ax2.xson%20-%3E%20'/recipient-country'-%3E0-%3E%3E'@percentage'%20as%20recipient_percentage,%0Ax2.xson%20-%3E%20'/sector'-%3E0-%3E%3E'@code'%20as%20sector,%0Ax2.xson%20-%3E%20'/sector'-%3E0-%3E%3E'@vocabulary'%20as%20sector_vocab,%0Ax2.xson%20-%3E%20'/sector'-%3E0-%3E%3E'@percentage'%20as%20sector_percentage,%0Ax2.xson%20-%3E%3E%20'/activity-status@code'%20as%20status%0A%0Afrom%20xson%20as%20x1,%0Axson%20as%20x2%20%0A%0Awhere%20x1.root='/iati-activities/iati-activity/document-link'%20AND%0Ax2.root='/iati-activities/iati-activity'%20AND%0Ax1.aid=x2.aid%0A%0Alimit%20100;). It might be better to 'Download XSON' as various formats without running the query in the browser as it could slow your browser!

We found that scaling this query doesn't work and downloads time out as there's too much data to comb through. Simplifying the query, however, makes it feasible and much less stress on the database.

Here is the first 10000 results for a much simpler document-link url query. Changing the offset should allow you to page through the results.

There are probably much more efficient ways to write the query code but this was a first pass :)

notshi commented 3 years ago

So we looked at optimising the query and this seems to work.

Here is the first 100,000 activities,'%20')%20%20as%20category,%0Ax1%20-%3E%20'/language'-%3E0-%3E%3E'@code'%20as%20lang,%0Ax1%20-%3E%3E%20'/document-date@iso-date'%20as%20isodate,%0Ax1%20-%3E%3E%20'@format'%20as%20file,%0Ax1%20-%3E%3E%20'@url'%20as%20url%0A%0A%0Afrom%20xson%20as%20x2%20,%20jsonb_array_elements(x2.xson%20-%3E%20'/document-link'%20)%20%20as%20x1%0A%0Awhere%20x2.root='/iati-activities/iati-activity'%20and%20x2.aid%20in%0A(%0Aselect%20aid%20from%20act%20order%20by%20aid%20limit%20100000%20offset%200%0A)%0A). You'll need to run that query 10 times by changing the offset to get all the data as this looks through all 1 million activities in IATI.

Again, you don't have to run the query, just click on 'Download XSON' to download the data in various formats. It seems the time to download varies depending on the data of that particular page.