IATI / ckanext-iati

CKAN extension for the IATI Registry
http://iatiregistry.org
9 stars 6 forks source link

CSV upload causing Gateway Timeout Error #115

Closed PetyaKangalova closed 6 years ago

PetyaKangalova commented 7 years ago

A number of users of the IATI Registry have experienced issues with the CSV upload feature and tickets have been raised seperately (#4272, #4352, #4429, #4439, #4928). While in most cases the CSV upload has worked and publishers were able to upload the files, the Gateway Timeout error remains. It will be useful to get any updates in here on how the investigation into this error is going and also confirmation once a fix of the bug has been found.

visar commented 7 years ago

We should use a background job for this: http://docs.ckan.org/en/latest/maintaining/background-tasks.html which basically means using celery. We can put the job in the queue: http://docs.ckan.org/en/latest/extensions/plugins-toolkit.html#ckan.plugins.toolkit.ckan.plugins.toolkit.enqueue_job and after it finishes it would take some action (i.e. notify the user that the job is finished). The biggest challenge is how we can integrate celery in our current infrastructure using Docker/Deis.

dalepotter commented 7 years ago

Thanks for looking into this @visar - @chrisviderum and @pmartignoli mentioned that this issue arises only for large CSV files.

Do we know how large (by either filesize or number of rows of data) that the file needs to be in order to trigger this issue?

visar commented 7 years ago

@dalepotter The csv itself might not be very large, but I think the way the uploader works is that it goes through each line of the csv and it imports as a dataset, thus hitting the timeout limit. So the issue occurs if the csv file has a lot of lines (usually hundreds of lines).

hayfield commented 7 years ago

@visar As you mentioned, the uploader does go through each line in a CSV file and updates the dataset specified on each row. Also within this loop is code such as line 219 that will be picked up a linter as problematic.

Looking at the tickets @PetyaKangalova mentioned, the number of Datasets each publisher has is:

With rows mapping 1:1 to Datasets, this is very much not, as you stated, "hundreds of lines", and the maximum Datasets that any publisher has is 258 for the United States.

dalepotter commented 7 years ago

Thanks @visar and @hayfield for providing extra clarity on this problem. It was good to look through the code that @hayfield linked to. Whilst this does look like a reasonably straightforward loop, I can see how, for 258 datasets, making 258 requests to the CKAN api could cause a timeout.

However, I'm mindful to ensure that we have fully explored other alternatives before we add celery, particularly given the time estimates for this. Can I ask if it is possible to send multiple datasets to the CKAN API with one request?

@PetyaKangalova - I presume that no further publishers (other than those that Hayden listed above) have encountered this issue? Do we know how frequently they experience this issue?

PetyaKangalova commented 7 years ago

@dalepotter @visar We haven't been contacted by other publishers encountering an error with the CSV upload so there are no additional ones to the one Hayden has provided. The frequency for publishers using the CSV upload is not very high as there are not that many publishers with large datasets.

This week Amy has been in touch with an organisation that is currently preparing to publish to IATI and will be using the CSV import functionality. She will post in here if they encounter the Gateway Timeout error as well.

PetyaKangalova commented 7 years ago

@pmartignoli Just to flag in here as well that another publisher encountered the same gateway timeout error while using the CSV upload. The publisher is the World Health Organisation- link to their Registry page in here.

pmartignoli commented 7 years ago

@PetyaKangalova @dalepotter

Following further investigation we have the same conclusion, that a background job using celery should be created to resolve this gateway timeout issue, with an estimated development and implementation time of 36 hours.

The upload time required is not solely dependant on the number of lines, but other possible factors such as traffic levels etc.

dalepotter commented 7 years ago

@pmartignoli As mentioned in the call this morning, let's go ahead with adding a background job, and use some of the time we have allocated for June for this.

pmartignoli commented 7 years ago

@dalepotter, we are happy to go ahead with this, however it would make sense to carry out following the upgrade to CKAN 2.6.2. https://github.com/ViderumGlobal/ckanext-iati/issues/126

newkdukem commented 7 years ago

Development for the background job is already underway and will be implemented after the CKAN 2.6.2 upgrade is done.

dalepotter commented 7 years ago

@newkdukem in an update call, you mentioned that you were looking for a sample (valid and invalid) CSV file for testing.

@pmartignoli mentioned that he has one of these already and you would send this to @newkdukem direct. From this valid version, changes could be made to make it invalid, and used for testing later.

newkdukem commented 7 years ago

@dalepotter thank you for the info. Also, the rework of the CSV upload is under way. Expected deployment on the staging portal is expected before the end of the week.

newkdukem commented 7 years ago

The CSV upload changes are been made and deployed and tested on the staging portal. The celery version is tested with a file that previously gave "Gateway Timeout Error". Edge cases that are covered and tested:

All of the mentioned edge cases will cause the csv upload to abort the process.

dalepotter commented 7 years ago

I've just tested for the first time with this csv file and I get this error:

Error opening CSV file: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?

Is this expected?

JGulic commented 7 years ago

@dalepotter did you created the csv file on Mac OS? When saving a csv file on Mac OS, Windows etc, there is additional carriage that is added at the end of the file which is not visible if you just open the file. The error that you are getting is because of this additional carriage in the csv file that you are trying to test. We've already handled Windows files but we didn't noticed that this can happen with MAC OS csv files. We are creating a patch that is going to handle cases like this one, which should be on the staging portal by tomorrow. You can still test out the csv upload feature on the staging portal, with a correctly formatted CSV file, and let us know if it works as expected? We'll let you know when the patch is deployed on the staging portal, so we can make sure that this error doesn't happen again.

dalepotter commented 7 years ago

@JGulic Thanks for the clarification. Yes the file was created using OS X Yosemite (v10.10.5), using LibreOffice.

Great that you have a fix in progress, though given that Macs have between 5-10% market share I would have thought that a CSV file created by this operating system would be an expected use case in the beginning.

In any case, is that this not that type of thing that can be handled automatically by the the CSV library/module that are you using?

andylolz commented 7 years ago

The file in question has CR line endings, which is the quite old Mac way of doing things (these days, OSX uses LF line endings; windows uses CRLF). I’m not sure why LibreOffice generated that.

I would say:

  1. This is probably pretty edge case
  2. It’s arguable whether this should be considered valid CSV at all, given RFC 4180 suggests CRLF are the only valid line endings for CSV.

I’m not terribly sure how to make this work in this case. CKAN uses Pylons, which is using cgi.FieldStorage here. I’m not sure how to get that to do the equivalent of universal newline mode.

JGulic commented 7 years ago

@dalepotter the CSV upload has been pushed to production and tested out. The fix for the file that you've tested is missing on production, but we'll deploy that soon. We've tested this out and we didn't encountered any problems. You can test it and let us know if everything works as expected.

dalepotter commented 6 years ago

Following testing this week with @PetyaKangalova this issue is regarded as fixed.

There is however a related PR #171 for small tweaks to user-facing messaging.