Gateway Timeout on validation page during load

hepcat72 commented 3 months ago

BUG DESCRIPTION

Problem

There appears to be an infrequent issue where form submissions on the start a submission page randomly hang and encounter a gateway timeout. THIS MAY ONLY BE DURING A LOAD ON THE COMMAND LINE OR DURING CONCURRENT VALIDATIONS.

Steps to reproduce

Concurrent load and validation
1. Edit TraceBase/.env to set SQL_LOGGING=True
2. In terminal window 1
3. python manage.py runserver
4. In terminal window 2
5. python manage.py load_study --infile __a_large_study_doc_like_the_water_study__
6. In browser window 1, on the submission validate page (after noting that the load has started making SQL queries)
7. Submit a large study doc, like the water study
Concurrent validation submissions
1. Edit TraceBase/.env to set SQL_LOGGING=True
2. In terminal window 1
3. python manage.py runserver
4. In browser window 1, on the submission validate page
5. Submit a large study doc, like the water study
6. In browser window 2, on the submission validate page (after noting that the first validation has started making SQL queries)
7. Submit a large study doc, like the water study

Previous Notes:

I have not been able to make it happen on purpose, but it can happen with apparently any submission, regardless of size. A study doc submission that repeatedly takes perhaps 3 seconds can encounter the timeout. It might have something to do with concurrent user submissions or perhaps a browser cache issue (though I seriously doubt it), because it can happen repeatedly, and happened to succeed after restarting the browser.

Perhaps a resource contention or deadlock? The file that I was using when I encountered the timeout was: study_workbook-johns_errors1.xlsx

Current behavior

The Concurrent load and validation case above is the clearest example, because you can see the SQL query that hangs in the runserver window:

SELECT "DataRepo_study"."id", "DataRepo_study"."name", "DataRepo_study"."description" FROM "DataRepo_study" WHERE ...

And you can watch the progress of the load in the other terminal window. When the load finishes, you can observe that the validation process resumes immediately.

In the Concurrent validation submissions case, it's harder to see what is going on, because both jobs are outputting to the same console, but you can time the 2 browser windows and note that the second window takes twice as long as the first one.

Previous Notes:

The page hangs for maybe a minute and then you get a gateway timeout. The log files are too verbose right now (intentionally for (issue-specific) debugging on dev) to find these random occurrences, but I did obtain access to the web logs and could not find anything obvious with a brief glance (hours after the issue occurred) - needs a focussed look. Sven said that memory usage was completely fine, but there was some heavy cpu usage that could have caused the issue, though the timing didn't seem to correspond with the first occurrence (first occurrence was before 12:19pm July 29th 2024) and the spike in the logs was around 2:15pm (that could have corresponded with us encountering the issue while trying to debug it).

It's possible it could be data-specific maybe (although I feel like that's a long shot, since the same file can either take 3 seconds or time out)?

It's also possible it could be related to CAS, as the file being submitted was at one point, selected off the ms data share, which required netid authentication to mount...

Expected behavior

Every validation job should finish without a timeout error, despite concurrent requests.

Suggested Change

There are 2 things to solve here:

Long running database loads that can take in excess of an hour should definitely be a sort of downtime maintenance thing. At the least, the validation page (and admin edit pages) could have downtime messages.
Concurrent study validation requests (any one of which should take under a minute), are effectively processed serially, meaning that if the combined running time of 2 jobs is over a minute, the second one will encounter a timeout error.

There are a number of possible steps to take to deal with the concurrent validation issue:

The likely impossible holy grail option: In some way, tell the database to ignore table/row locks because we promise not to commit anything.
Optimize the loading code to make it as fast as possible so that multiple concurrent submissions, despite having to run serially, complete within the server's timeout deadline. The problem here is that there's only so much you can do in this regard. The simple fact is, we're dealing with large datasets, and with automated peak picking, it is expected to grow by up to 2 orders of magnitude.
In some way, catch these blocking concurrent validations and
- Tell the user that the server is busy, and to try again later.
- Schedule validation jobs and email the results.
Employ an asynchronous job scheduler like Celery that spawns the job in a separate process that reports progress in the form of a waiting page/progress bar.
Implement a totally different validation process that only makes queries instead of attempting loads inside an atomic transaction that never commits.

Previous Notes:

If this is due to concurrent submissions or submissions during a command line (dry-run) load, we may be able to queue submissions and/or temporarily disable validation during loads. We should probably also schedule loads to happen over night. The issue can also be mitigated by speeding up the load scripts.

~We need to figure out how to reproduce the issue reliably. I was thinking of trying multiple concurrent submissions to see if that can cause it, but I haven't done it yet. Should also watch the log.~

Comment

None

ISSUE OWNER SECTION

Assumptions

List of assumptions made WRT the code
E.g. We will assume input is correct (explaining why there is no validation)

Limitations

A list of things this work will specifically not do
E.g. This feature will only handle the most frequent use case X

Affected Components

change: File path or DB table ...
add: Environment variable or server setting
delete: External executable or cron job

Requirements

[ ] 1. List of numbered conditions to be met for the feature
[ ] 2. E.g. Every column/row must display a value, i.e. cannot be empty
[ ] 3. Numbers for reference & checkboxes for progress tracking

DESIGN

GUI Change description

None provided