Strange issue on upload

emanuil-tolev commented 8 years ago

Uploaded to compliance.cottagelabs.com using recent Firefox/Win7 (not mine so no version #).

The sheet being uploaded is FULL University Returns 2013-14.csv, https://drive.google.com/a/cottagelabs.com/file/d/0B3sDwcEtNOd6cmtnVDc4QUJRY28/view?usp=sharing .

It took a while then came up with this:

However, I also received an email telling me I had successfully uploaded it, so I clicked on the link in that email and got taken here:

I then immediately received a second email saying I’d successfully uploaded the same spreadsheet, though I only uploaded it once. They have different links in them, so they must be being treated as separate uploads but I have no idea how.

Emails below:

---------- Forwarded message ----------
From: "us@cottagelabs.com" <us@cottagelabs.com>
To: [Uploader redacted by ET]
Cc: 
Date: Wed, 20 Jul 2016 15:44:29 +0000
Subject: Job FULL University Returns 2013-14.csv submitted successfully
Hi [Uploader's email address redacted by ET]

Thanks very much for submitting your processing job FULL University Returns 2013-14.csv.

You can track the progress of your job at https://compliance.cottagelabs.com#ZbtQpf4KaEB5nkdpi

The Cottage Labs team

P.S This is an automated email, please do not reply to it.

---------- Forwarded message ----------
From: "us@cottagelabs.com" <us@cottagelabs.com>
To: [Uploader redacted by ET]
Cc: 
Date: Wed, 20 Jul 2016 15:42:45 +0000
Subject: Job FULL University Returns 2013-14.csv submitted successfully
Hi [Uploader's email address redacted by ET]

Thanks very much for submitting your processing job FULL University Returns 2013-14.csv.

You can track the progress of your job at https://compliance.cottagelabs.com#bpkGTQnrFPpwBERXz

The Cottage Labs team

P.S This is an automated email, please do not reply to it.

emanuil-tolev commented 8 years ago

@markmacgillivray not sure what to follow up here - shall I try maybe to reproduce on browserstack? If you don't have any immediate ideas about what could be causing this. The 2 job IDs are in the emails above if that'd help. I don't think the button can be clicked twice on submission (without refreshing the page) and we probably have to trust that the page didn't get refreshed, at least until we confirm 100% we can/can't reproduce this.

markmacgillivray commented 8 years ago

I have checked this myself and I also get the error, although the job does seem to upload, and did seem to create two jobs. I definitely did not click twice. But I did not receive any emails at all, which is even more strange.

The created jobs are identical, both have 2531 processes, although the input spreadsheet has 2557 rows.

@emanuil-tolev have you checked to see if this is the number of results you / Cecy would expect, or should it actually have 2557 (maybe 2556 if top row is headers, I have not checked the count exactly)?

markmacgillivray commented 8 years ago

Is Cecy expecting this one fixed, or is this an additional error? I doubt I am going to have time to debug this one this week.

markmacgillivray commented 8 years ago

Ah I just got the expected emails, so at least there is not an additional complication here.

emanuil-tolev commented 8 years ago

She probably expects it to be fixed in the sense that it's a detractor from reliability.

@emanuil-tolev have you checked to see if this is the number of results you / Cecy would expect, or should it actually have 2557 (maybe 2556 if top row is headers, I have not checked the count exactly)?

No, I've not - I mostly rely on the system itself to read all rows in. So I wouldn't really have any expectations different from what the system does, at least not unless I manually comb through the sheet looking for empty rows. The numbers sound close enough - unless you suspect a parsing problem I wouldn't question them.

markmacgillivray commented 8 years ago

Just confirming in this comment I will try to fix this asap

emanuil-tolev commented 8 years ago

Some more information about further instances of this bug being encountered:

I also had this issue with the attached spreadsheet: https://compliance.cottagelabs.com#gdRyrQD536En78ASM

I got an error message, so changed the email to [personal email]@gmail.com in case it was related to Lantern and got a longer error message:

Sorry, there has been an error with your submission. Please try again. If you continue go receive an error, please contact us@cottagelabs.com attaching a copy of your file and with the following error information: {"readyState":4,"responseText":"{\n \"status\": \"error\",\n \"data\": {\n \"length\": 382,\n \"quota\": {\n \"admin\": false,\n \"premium\": false,\n \"additional\": 0,\n \"until\": false,\n \"display\": false,\n \"email\": \"[personal email]@gmail.com\",\n \"count\": 0,\n \"max\": 100,\n \"available\": 100,\n \"allowed\": true\n },\n \"info\": \"382 greater than remaining quota 100\"\n }\n}","responseJSON":{"status":"error","data":{"length":382,"quota":{"admin":false,"premium":false,"additional":0,"until":false,"display":false,"email":"[personal email]@gmail.com","count":0,"max":100,"available":100,"allowed":true},"info":"382 greater than remaining quota 100"}},"status":413,"statusText":"Request Entity Too Large"}

I immediately then received two emails to my Wellcome email address telling me my upload was successful.

I stupidly tried to upload the original spreadsheet I had this problem with too FULL University Returns 2013-14.csv, and ended up with FIVE emails for one submission…

markmacgillivray commented 8 years ago

In original sheet: 2557 rows including header row, so 2556 rows of data. 25 blank rows. So the 2531 processes that get created for the job is correct amount. 2247 PMCIDs. 2494 PMIDs. 2524 DOIs. These numbers are confirmed by what the UI can extract from the sheet using FileReader.

Quick visual scan of the file for oddities shows:

All the lines have hanging commas at the end, where the sheet the csv was created from must have had blank columns. But there are no extra empty rows after the last row of content.

Lines 1194 and 1543 contain additional statements after the DOI, but within the comma-delimited column, so would not be valid DOIs

1839 the DOI has a dot at the end, which makes it incorrect. There could be others like this - hard to spot with visual scan

These issues should not cause any problem in parsing the sheet though. Confirmed via UI the sheet is read successfully and the above rows are read with extra data stuck in the DOI field for each.

However every row of data also has a blank key/value pair... checked this and it seems any time there is a column header that is blank (e.g. caused by following commas from empty columns in the original sheet) then there is a key/value pair with blank key and blank value, but never more than one (because the blank keys match each other). So this is common and cannot be the cause of this problem, although could be filtered out for neatness by checking for blank keys in the file parser on the UI. I have added that fix too, but does not solve this.

I turned off the cluster and submitted only to the main machine, resulting in the UI throwing a 502 eventually although the service did not actually go down. This may be some sort of timeout issue again, although it is not a timeout error. The job did submit though, it just did not send a response to the UI so the user screen never updated and eventually showed the error.

Just running on the main machine though, the job processed and I only received one email confirming submission and one email confirming completion. This suggests that it may be something to do with having the cluster running, in combination with a timeout, that is causing multiple jobs to be created and therefore multiple emails to send. If one job is timing out before returning a response but the job is actually being created, perhaps somewhere a second one is getting kicked off too.

So, progress, but not fixed yet.

The cluster is back on now, and I am going to analyse the process of creating jobs and how they get allocated, and see if I can find what causes the timeout problem.

So things that could be confirmed for now to Cecy if you @emanuil-tolev wish:

the spreadsheets are still parsed and uploaded successfully, as expected
the jobs do process successfully, as expected
the UI timeout and multiple job IDs and emails are due to system architecture and would not actually cause the jobs not to complete, BUT will still show as an error

Once I find where the timeout problem comes from, the whole problem should go away.

markmacgillivray commented 8 years ago

Checked with 1000 rows of full sheet, works OK with response time taking about 90s. Probably cannot expect to submit much faster than that. Could change the way the response is sent back to the user, so they don't actually wait for confirmation on the UI but just via email, then the UI would take them straight to status of job, although at that point the job may still not exist so they would just be on a holding page anyway. Not much difference...

It does seem to be a longer submission time now though than it used to be. The creation of processes has changed to move some of the logic away from the status checking, so it may be more intensive now to create a large job than it used to be.

(It is not the time to actually run the processes - they are all already in the system, so it is just checking to see if they exist yet, which they do, but this is what is taking the time, probably.)

Checking with 2000 rows...

2000 fails with 4.1 minutes total processing time. UI reported error but the job was created and did send 2 emails confirming submission of 2 jobs. So it is the cluster setup that is picking up the fail and throwing it all in again, it seems. Or a sheet error between rows 1000 and 2000 but that seems less likely.

nginx settings are at 300s for timeout just now. Increasing to 600 and trying 2000 again...

Failed after a 2.2 minute wait and created 4 copies of the job.

So the error is not a timeout, but an actual fail of some sort, and being caused at different points it seems. Again, unless it has something to do with the content in question. Will try with the last 500-odd...

Succeeded in 51s

Trying with first 1000 and last 500...

Failed at 4.1 minutes again - which is about twice the time of doing the first 1000 and last 500 individually. Caused two emails and jobs.

Trying full submission with cluster off but timeouts still set higher...

Hit a 502 after 2 minutes. Still submitted one job successfully, and got one email for submission and completion.

nginx ip hashing timeout? Tried resetting to different nginx configs, still got 502 error after 2 minutes. But did see some errors in jobs go by:

I20160808-16:23:41.523(1) (synced-cron-server.js:63) SyncedCron: Starting "lantern". I20160808-16:23:41.526(1)? 6TZsf7dZfjKrhyD9k I20160808-16:23:41.574(1)? http://www.ebi.ac.uk/europepmc/webservices/rest/search?query=DOI:10.1017/S0022149X14000431&resulttype=core&format=json I20160808-16:24:01.617(1) (synced-cron-server.js:63) SyncedCron: Exception "lantern" Error: getaddrinfo ENOTFOUND at Object.Future.wait (/home/cloo/.meteor/packages/meteor-tool/.1.3.2_4.okm7y4++os.linux.x86_64+web.browser+web.cordova/mt-os.linux.x86_64/dev_bundle/server-lib/node_modules/fibers/future.js:420:15) at Object.call (packages/meteor/helpers.js:119:1) at Object.CLapi.internals.use.europepmc.search (app/apps/clapi/server/endpoints/use/europepmc.js:155:25) at Object.CLapi.internals.use.europepmc.doi (app/apps/clapi/server/endpoints/use/europepmc.js:124:43) at Object.CLapi.internals.service.lantern.process (app/apps/clapi/server/endpoints/service/lantern.js:694:53) at Object.CLapi.internals.service.lantern.nextProcess as job at packages/percolate_synced-cron/synced-cron-server.js:218:1 at scheduleTimeout (packages/percolate_synced-cron/synced-cron-server.js:266:1) at packages/percolate_synced-cron/synced-cron-server.js:314:1

Because it appears the eupmc ebi address is totally down!

emanuil-tolev commented 8 years ago

We could display some kind of additional text to let users know it could take much longer if there are over 100-200 records in the sheet in addition to the submission button saying "Submitting, please wait"? The UI knows the count of records.

markmacgillivray commented 8 years ago

The only possibility left is that it is to do with hangs somewhere between the cluster and the machine issuing the request. Given all the layers it may be impossible to find (at least within reasonable timeframe and effort). Instead, will make a POST to the job endpoint return a job ID that can be used immediately to check job status, but have first few checks resulting in job not found or job still compiling or something of that nature. This should ensure that a response is received promptly by the UI, and that only one email is sent as expected.

@richard-jones just pinging you because this is going to affect the API docs a little bit - the POST of job creation may have a different response object. Will let you know once done.

emanuil-tolev commented 8 years ago

Some original feedback (not parsed by me like above) - in case it helps debug the "hangs somewhere between cluster and machine issuing the request". This is before you did final fixes on #85 (wellcome vs lantern quota).

@emanuil-tolev says:

Strange issue duplicating jobs on upload - we've confirmed and reproduced this. You upload a sheet (or at least the sheet "FULL University Returns 2013-14.csv") and it creates two jobs without you clicking Submit twice. We're in the middle of debugging it.

I also had this issue with the attached spreadsheet: https://compliance.cottagelabs.com#gdRyrQD536En78ASM

I got an error message, so changed the email to [email omitted] in case it was related to Lantern and got a longer error message:

Sorry, there has been an error with your submission. Please try again. If you continue go receive an error, please contact us@cottagelabs.com attaching a copy of your file and with the following error information: {"readyState":4,"responseText":"{\n \"status\": \"error\",\n \"data\": {\n \"length\": 382,\n \"quota\": {\n \"admin\": false,\n \"premium\": false,\n \"additional\": 0,\n \"until\": false,\n \"display\": false,\n \"email\": \"[email omitted]\",\n \"count\": 0,\n \"max\": 100,\n \"available\": 100,\n \"allowed\": true\n },\n \"info\": \"382 greater than remaining quota 100\"\n }\n}","responseJSON":{"status":"error","data":{"length":382,"quota":{"admin":false,"premium":false,"additional":0,"until":false,"display":false,"email":"[email omitted]","count":0,"max":100,"available":100,"allowed":true},"info":"382 greater than remaining quota 100"}},"status":413,"statusText":"Request Entity Too Large"}

I immediately then received two emails to my Wellcome email address telling me my upload was successful.

I stupidly tried to upload the original spreadsheet I had this problem with too FULL University Returns 2013-14.csv, and ended up with FIVE emails for one submission…

markmacgillivray commented 8 years ago

Fixed as described above, the POST to job now returns immediately but the progress endpoint has a new key called "new". When it is true, the job is still actually loading, so of course progress will always be 0 at this stage. A job of 3000 rows (the longest single job allowed) can take about 4 minutes to load (depends on the complexity of the records provided though). "new" will be set to false on the progress endpoint when the job is actually loaded and processing can begin.

@richard-jones - changes to API docs you may want to add, are just that the progress endpoint now includes the "new": true/false key, and a POST of a new job will actually return immediately (same data as before).

markmacgillivray commented 8 years ago

I also updated the UIs to display a "this job is new, pls wait before seeing progress change from 0" msg on the progress page, for as long as the job is new.

CottageLabs / LanternPM

Strange issue on upload #88