IATI / datastore-search

Browser application for searching the IATI Datastore via its API
https://datastore.iatistandard.org/
GNU Affero General Public License v3.0
0 stars 0 forks source link

Not able to download data #665

Open siemvaessen opened 9 months ago

siemvaessen commented 9 months ago

Brief Description I just tried downloading IATI data from https://datastore.iatistandard.org/?q=* which is not possible (on the activity endpoint I tried) for any of the formats provided on the download option.

Severity High

Issue Location Datastore issue URI

Steps to Reproduce Add a list of actions needed to replicate the error. Steps to reproduce the behavior:

  1. Go to 'https://datastore.iatistandard.org/?q=*'
  2. Click on 'Download Data'
  3. Scroll down to 'Pick format' any format
  4. See Javascript error.

Expected Results/Behaviour I would expect it to start a download in any given format

Actual Results/Behaviour

Screenshot 2023-12-13 at 12 33 09
odscjames commented 9 months ago

Thanks for the report.

I guess there is a practical limit on how much data can be downloaded directly from the web UI in one go. We should highlight that in the web UI and point to alternative options instead.

Talking of which .... is our API suitable for this? This is described at https://developer.iatistandard.org/ and allows paging.

However if the main need is just to get all the data in one download then other tools may be more suitable - if you or anyone else reading this wants help please do contact support https://iatistandard.org/en/guidance/get-support/

siemvaessen commented 9 months ago

I would think if the option provided by the interface to download a response in either given format should allow for that to happen and not be directed to some other tool. I would expect whatever the results from a search to be downloadable in those formats. Apache Solr is powering this datastore, correct?

odscjames commented 9 months ago

Yes, it is Solr.

However for various reasons there may be an upper limit on how much data can be directly downloaded from the website and this is ok - the website is meant for exploratory use, to see what is possible. Other tools or the API may be more suitable depending on what people are trying to achieve. However we are reviewing this tool, other tools and how users discover and work across tools - and we can review this at the same time.

But whatever happens the UI should reflect what the user can do and the UI should never just drop people at an unhelpful error - we'll use this ticket to fix that error message and make sure people have a better user experience. Thanks.

siemvaessen commented 3 months ago

Just a follow up - I am looking (search/query) for food security. I get 149839 results and would like to download that as a CSV format file for the 'transaction' core, but it does not download, but it keeps spinning w/o results. Is there a max size set somewhere, which does not allow for large results downloads (for a specific core) - I am trying to compare some data. Thanks.

simon-20 commented 2 months ago

Hi @siemvaessen,

Thanks for prompting us further on this.

I performed the search you listed above and it said there were 159178 activities.

I was able to download these as XML. It's worth noting that the site took 4.5 mins to prepare the download, and the cursor spins for this entire time. This many activities resulted in a 2.8 Gb file, which then took a further few minutes to download.

I also tried it as CSV, on the transaction core. Here I encountered the problem as you described it above. The immediate issue here is not the amount of data it's preparing for download, but the time it is taking to do this preparing. (You can see a little bit of what is going on by loading Developer Tools in Chrome (or equivalent) and having a look at the Network tab). This particular download is taking longer than 10 minutes to prepare. As things are currently setup, there is a 10 minute timeout on download preparations. So this particular download request will fail.

An interim solution would be for you to split this search into two (or maybe three), and download each separately.

I will look into the various options we have to address this issue. It is likely, though, that some limit will have to be imposed for CSV downloads. Because CSV is a flat format, when hierarchical data is flattened, sometimes the number of results explodes, because the same item has to be listed many times, or the same data has to be repeated many times: e.g. if an activity has a hundred transactions, the details of the activitiy have to be included in each CSV row, rather than just once as with XML or JSON. This is why when requesting results as CSV one is more likely to encounter this timeout.

But I'll look into possible ways forward, that might make more searches and downloads possible.

Simon