Closed PetyaKangalova closed 6 years ago
We should use a background job for this: http://docs.ckan.org/en/latest/maintaining/background-tasks.html which basically means using celery. We can put the job in the queue: http://docs.ckan.org/en/latest/extensions/plugins-toolkit.html#ckan.plugins.toolkit.ckan.plugins.toolkit.enqueue_job and after it finishes it would take some action (i.e. notify the user that the job is finished). The biggest challenge is how we can integrate celery in our current infrastructure using Docker/Deis.
Thanks for looking into this @visar - @chrisviderum and @pmartignoli mentioned that this issue arises only for large CSV files.
Do we know how large (by either filesize or number of rows of data) that the file needs to be in order to trigger this issue?
@dalepotter The csv itself might not be very large, but I think the way the uploader works is that it goes through each line of the csv and it imports as a dataset, thus hitting the timeout limit. So the issue occurs if the csv file has a lot of lines (usually hundreds of lines).
@visar As you mentioned, the uploader does go through each line in a CSV file and updates the dataset specified on each row. Also within this loop is code such as line 219 that will be picked up a linter as problematic.
Looking at the tickets @PetyaKangalova mentioned, the number of Datasets each publisher has is:
With rows mapping 1:1 to Datasets, this is very much not, as you stated, "hundreds of lines", and the maximum Datasets that any publisher has is 258 for the United States.
Thanks @visar and @hayfield for providing extra clarity on this problem. It was good to look through the code that @hayfield linked to. Whilst this does look like a reasonably straightforward loop, I can see how, for 258 datasets, making 258 requests to the CKAN api could cause a timeout.
However, I'm mindful to ensure that we have fully explored other alternatives before we add celery, particularly given the time estimates for this. Can I ask if it is possible to send multiple datasets to the CKAN API with one request?
@PetyaKangalova - I presume that no further publishers (other than those that Hayden listed above) have encountered this issue? Do we know how frequently they experience this issue?
@dalepotter @visar We haven't been contacted by other publishers encountering an error with the CSV upload so there are no additional ones to the one Hayden has provided. The frequency for publishers using the CSV upload is not very high as there are not that many publishers with large datasets.
This week Amy has been in touch with an organisation that is currently preparing to publish to IATI and will be using the CSV import functionality. She will post in here if they encounter the Gateway Timeout error as well.
@pmartignoli Just to flag in here as well that another publisher encountered the same gateway timeout error while using the CSV upload. The publisher is the World Health Organisation- link to their Registry page in here.
@PetyaKangalova @dalepotter
Following further investigation we have the same conclusion, that a background job using celery should be created to resolve this gateway timeout issue, with an estimated development and implementation time of 36 hours.
The upload time required is not solely dependant on the number of lines, but other possible factors such as traffic levels etc.
@pmartignoli As mentioned in the call this morning, let's go ahead with adding a background job, and use some of the time we have allocated for June for this.
@dalepotter, we are happy to go ahead with this, however it would make sense to carry out following the upgrade to CKAN 2.6.2. https://github.com/ViderumGlobal/ckanext-iati/issues/126
Development for the background job is already underway and will be implemented after the CKAN 2.6.2 upgrade is done.
@newkdukem in an update call, you mentioned that you were looking for a sample (valid and invalid) CSV file for testing.
@pmartignoli mentioned that he has one of these already and you would send this to @newkdukem direct. From this valid version, changes could be made to make it invalid, and used for testing later.
@dalepotter thank you for the info. Also, the rework of the CSV upload is under way. Expected deployment on the staging portal is expected before the end of the week.
The CSV upload changes are been made and deployed and tested on the staging portal. The celery version is tested with a file that previously gave "Gateway Timeout Error". Edge cases that are covered and tested:
All of the mentioned edge cases will cause the csv upload to abort the process.
I've just tested for the first time with this csv file and I get this error:
Error opening CSV file: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
Is this expected?
@dalepotter did you created the csv file on Mac OS? When saving a csv file on Mac OS, Windows etc, there is additional carriage that is added at the end of the file which is not visible if you just open the file. The error that you are getting is because of this additional carriage in the csv file that you are trying to test. We've already handled Windows files but we didn't noticed that this can happen with MAC OS csv files. We are creating a patch that is going to handle cases like this one, which should be on the staging portal by tomorrow. You can still test out the csv upload feature on the staging portal, with a correctly formatted CSV file, and let us know if it works as expected? We'll let you know when the patch is deployed on the staging portal, so we can make sure that this error doesn't happen again.
@JGulic Thanks for the clarification. Yes the file was created using OS X Yosemite (v10.10.5), using LibreOffice.
Great that you have a fix in progress, though given that Macs have between 5-10% market share I would have thought that a CSV file created by this operating system would be an expected use case in the beginning.
In any case, is that this not that type of thing that can be handled automatically by the the CSV library/module that are you using?
The file in question has CR line endings, which is the quite old Mac way of doing things (these days, OSX uses LF line endings; windows uses CRLF). I’m not sure why LibreOffice generated that.
I would say:
I’m not terribly sure how to make this work in this case. CKAN uses Pylons, which is using cgi.FieldStorage
here. I’m not sure how to get that to do the equivalent of universal newline mode.
@dalepotter the CSV upload has been pushed to production and tested out. The fix for the file that you've tested is missing on production, but we'll deploy that soon. We've tested this out and we didn't encountered any problems. You can test it and let us know if everything works as expected.
Following testing this week with @PetyaKangalova this issue is regarded as fixed.
There is however a related PR #171 for small tweaks to user-facing messaging.
A number of users of the IATI Registry have experienced issues with the CSV upload feature and tickets have been raised seperately (#4272, #4352, #4429, #4439, #4928). While in most cases the CSV upload has worked and publishers were able to upload the files, the Gateway Timeout error remains. It will be useful to get any updates in here on how the investigation into this error is going and also confirmation once a fix of the bug has been found.