Open hepcat72 opened 3 months ago
A couple weeks ago, Sven implemented a way to always allow access to the web logs. I just ran into a gateway timeout error and at the end of the tracebase error log, I see this:
Timeout when reading response headers from daemon process 'tracebase': /var/www/tracebase/TraceBase/wsgi.py, referer: https://tracebase-dev.princeton.edu/DataRepo/submission
This appears to be potentially due to C extension packages being run in sub-interpreters (which explains the randomness of the issue), according to:
The solution is to add this to the httpd.conf
file:
WSGIApplicationGroup %{GLOBAL}
The full log line on tracebase-dev that alerted me to the cause:
/var/log/httpd/tracebase-error_log
[Sun Sep 01 12:21:00.236470 2024] [wsgi:error] [pid 1807785:tid 1807825] [client 172.20.196.21:60293] Timeout when reading response headers from daemon process 'tracebase': /var/www/tracebase/TraceBase/wsgi.py, referer: https://tracebase-dev.princeton.edu/DataRepo/submission
@jcmatese - I think I found the cause of those timeouts you reported. See above.
I will point out that it appears that the timeouts seem to repeatedly (anecdotally) coincide with me running a load on the command line (i.e. not an apache process).
Sven implemented the above suggestion from ticket 105658. I will see if I can reproduce the timeout on dev and get back to Sven about also implementing on prod.
@jcmatese - Could you please see if you can get your gateway timeout to happen again on dev?
Sven's changes appeared to break tracebase-dev. He reverted those changes. I poked around in the linked stack posts some more. It looks like there could be multiple possible contention problems. The next likely candidate is database access, as this did happen while I was running a load script on tracebase-dev. I'm going to do some more testing to confirm that possibility.
In the meantime @jcmatese - there's no need for you to test anything.
I asked on the Django forum and they said that it could be due to the atomic transaction. To debug, they referred me to a postgresql doc on lock monitoring. Here is the forum post link:
https://forum.djangoproject.com/t/operability-during-long-running-loads/35843/2
I did some testing on this today and learned some things...
I tested 2 scenarios:
create
or get_or_create
calls in the second submission to hang/wait until the first transaction releases its model and/or row locks.create
or get_or_create
calls in the second submission to hang/wait until the first transaction releases its model and/or row locks.It seems to me that, given my understanding of atomic transactions from the django docs, it is DB row locks that are likely happening that make subsequent create-related operations hang until the lock is released. Queries are supposed to be able to work during these transactions, and given the SQL output I observed in my testing, a few pure selects did work before it got to the Study
record creation. Likely, specific rows can be locked for update-purposes occurring in the transaction, but probably new row creations are locked as well - or perhaps it specifically has to do with AutoField
s - so that you're not skipping numbers in other transactions when the first transaction fails and rolls back? Regardless, the end result is that only 1 validation or load can happen at one time.
I could not find a way to allow concurrent atomic transactions that do loads.
I did find a way to be able to test to see if a lock currently exists...
try:
Study.objects.select_for_update(nowait=True).get_or_create(**rec_dict)
except DatabaseError as de:
# A lock exists that prevents us from creating a record
I'm not actually sure if the above actually works on a create
or get_or_create
(the docs use .filter()
), but if it does, this could be used to tell the user to wait and try again later, or report what position in the queue they're in.
It may also be required that this has to be executed outside of any atomic transaction...
According to the django docs:
Usually, if another transaction has already acquired a lock on one of the selected rows, the query will block until the lock is released. If this is not the behavior you want, call select_for_update(nowait=True). This will make the call non-blocking. If a conflicting lock is already acquired by another transaction, DatabaseError will be raised when the queryset is evaluated.
A quick solution might be to check if there's a lock on a row of the study table. If there is, render a page for the user that tells them that the server is busy and to try again later.
A more complete solution would be to finally install celery and queue up these jobs, reporting the users in front of the current request.
An ultimate solution (which I don't think is possible) would be to tell django that we intend to roll back anyway and just allow the validation to proceed as if this will never be committed. I.e. don't reserve an ID in an AutoField
and we don't care if another process is updating the same record.
BUG DESCRIPTION
Problem
There appears to be an infrequent issue where form submissions on the start a submission page randomly hang and encounter a gateway timeout. THIS MAY ONLY BE DURING A LOAD ON THE COMMAND LINE OR DURING CONCURRENT VALIDATIONS.
Steps to reproduce
TraceBase/.env
to setSQL_LOGGING=True
python manage.py runserver
python manage.py load_study --infile __a_large_study_doc_like_the_water_study__
TraceBase/.env
to setSQL_LOGGING=True
python manage.py runserver
Previous Notes:
I have not been able to make it happen on purpose, but it can happen with apparently any submission, regardless of size. A study doc submission that repeatedly takes perhaps 3 seconds can encounter the timeout. It might have something to do with concurrent user submissions or perhaps a browser cache issue (though I seriously doubt it), because it can happen repeatedly, and happened to succeed after restarting the browser.
Perhaps a resource contention or deadlock? The file that I was using when I encountered the timeout was: study_workbook-johns_errors1.xlsx
Current behavior
The Concurrent load and validation case above is the clearest example, because you can see the SQL query that hangs in the
runserver
window:And you can watch the progress of the load in the other terminal window. When the load finishes, you can observe that the validation process resumes immediately.
In the Concurrent validation submissions case, it's harder to see what is going on, because both jobs are outputting to the same console, but you can time the 2 browser windows and note that the second window takes twice as long as the first one.
Previous Notes:
The page hangs for maybe a minute and then you get a gateway timeout. The log files are too verbose right now (intentionally for (issue-specific) debugging on dev) to find these random occurrences, but I did obtain access to the web logs and could not find anything obvious with a brief glance (hours after the issue occurred) - needs a focussed look. Sven said that memory usage was completely fine, but there was some heavy cpu usage that could have caused the issue, though the timing didn't seem to correspond with the first occurrence (first occurrence was before 12:19pm July 29th 2024) and the spike in the logs was around 2:15pm (that could have corresponded with us encountering the issue while trying to debug it).
It's possible it could be data-specific maybe (although I feel like that's a long shot, since the same file can either take 3 seconds or time out)?
It's also possible it could be related to CAS, as the file being submitted was at one point, selected off the ms data share, which required netid authentication to mount...
Expected behavior
Every validation job should finish without a timeout error, despite concurrent requests.
Suggested Change
There are 2 things to solve here:
There are a number of possible steps to take to deal with the concurrent validation issue:
Celery
that spawns the job in a separate process that reports progress in the form of a waiting page/progress bar.Previous Notes:
If this is due to concurrent submissions or submissions during a command line (dry-run) load, we may be able to queue submissions and/or temporarily disable validation during loads. We should probably also schedule loads to happen over night. The issue can also be mitigated by speeding up the load scripts.
~We need to figure out how to reproduce the issue reliably. I was thinking of trying multiple concurrent submissions to see if that can cause it, but I haven't done it yet. Should also watch the log.~
Comment
None
ISSUE OWNER SECTION
Assumptions
Limitations
Affected Components
Requirements
DESIGN
GUI Change description
None provided
Code Change Description
None provided
Tests