Closed jluethi closed 1 year ago
I found in the end that it was not an issue with the meta file put with the args_cellpose_segmentation.json. The mistake that triggered the json decoder error was the following:
fr{
"level": 2,
"wavelength_id": "A01_C01",
"output_label_name": "nuclei",
"input_ROI_table": "FOV_ROI_table",
"model_type": "nuclei",
"diameter_level0": 120
}
I tried resubmitting the workflow with the erroneous args_cellpose_segmentation.json and I only got the json decoder error this time and the database was not locked. So it might have been a weird coincidence that it happened the first time.
Ok, thanks @adrtsc ! Yes, these types of errors should cause a JSON decoding error and then the task isn't added to the workflow => shouldn't crash the server or block the database.
My suspicion is that something else than the JSONDecodeError happened, but I don't have access to the logs this far back atm and can't trace it.
Let's keep an eye on whether we find other issues that can lock the database and whether we can find the relevant logs on what happened in those cases.
Thanks for reporting, this obviously should have not happened.
General questions:
it's another case of something that would get better with postgres.
That'd be my first guess, see e.g. https://docs.djangoproject.com/en/dev/ref/databases/#database-is-locked-errors. Let's quickly move towards using postgres in all instances.
Re: logs
We don't store them to file (that would be part of #620), so there's no obvious way to retrieve them. For the moment the easiest workaround would be to redirect stdout/stderr of the fractalctl start
command to a local file.
Was the original error triggered from the command-line client or from the web client? In both cases, it's hard to imagine how a wrong JSON file would lead to this error. This is very clear e.g. for the command-line client: if it cannot read an args JSON file, it will fail and it will not make any call to fractal-server.
From the command line client. And I agree! It's just the only error Adrian observed.
Regarding logs: +1 for #620 then :) Once we have the logs and if we observe this kind of error again, let's dig into them.
For now, I don't think there's anything that needs to be done on this issue, let's see if we trigger it again using sqlite :)
I reproduced this issue (or a related one) as in the following. I don't have a clear picture yet, but I'm setting up a way to debug.
I installed fractal-server (as in fractal-demos/example/server
, but with version 1.2.2 and also with the gunicorn extra), and I ran it in two ways (always with a sqlite db):
gunicorn fractal_server.main:app --workers 4 --worker-class uvicorn.workers.UvicornWorker --bind 127.0.0.1:$PORT --access-logfile - > fractal_server.log 2>&1
.fractalctl start
Then (after setting up the user/tasks, with the examples/00_user_setup/local_user_setup.sh
script), I ran multiple simultaneous instances of 01_cardio_tiny_dataset/run_example.sh
, in batches of size three (that is, I started three examples, three others that would start right after the first ones, and so on).
I used the first test (the one with gunicorn and 4 workers) to make sure that multiple requests are reaching the db simultaneously. I checked this via fuser db/fractal_server.db
("fuser displays the PIDs of processes using the specified files or file systems"), which at some point showed output like
$ fuser db/fractal_server.db
/some/path/fractal-demos/examples/server/db/fractal_server.db: 88860 88861 88863
where the three PIDs are part of the gunicorn ones, as per these logs
[2023-04-26 13:43:49 +0200] [88859] [INFO] Starting gunicorn 20.1.0
[2023-04-26 13:43:49 +0200] [88859] [INFO] Listening at: http://127.0.0.1:8010 (88859)
[2023-04-26 13:43:49 +0200] [88859] [INFO] Using worker: uvicorn.workers.UvicornWorker
[2023-04-26 13:43:49 +0200] [88860] [INFO] Booting worker with pid: 88860
[2023-04-26 13:43:49 +0200] [88861] [INFO] Booting worker with pid: 88861
[2023-04-26 13:43:49 +0200] [88863] [INFO] Booting worker with pid: 88863
[2023-04-26 13:43:49 +0200] [88864] [INFO] Booting worker with pid: 88864
...
In this way, I know that I am triggering multiple simultaneous db operations.
Also, this suggests that sqlite itself has no problem in handling this kind of workload (as expected.. see https://www.sqlite.org/whentouse.html)
I then re-ran the same example, but running fractal-server with the usual fractalctl start
, and in this case fuser
only listed a single process (as expected). However, I soon hit the locked-database error - see relevant part of the fractal-server logs below
...
INFO: 127.0.0.1:51250 - "POST /api/v1/workflow/1/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:51262 - "POST /api/v1/workflow/3/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:51278 - "POST /api/v1/workflow/2/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:51284 - "POST /api/v1/project/apply/ HTTP/1.1" 202 Accepted
INFO: 127.0.0.1:51300 - "POST /api/v1/workflow/3/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:51314 - "POST /api/v1/project/apply/ HTTP/1.1" 202 Accepted
INFO: 127.0.0.1:51324 - "POST /auth/token/login HTTP/1.1" 200 OK
INFO: 127.0.0.1:51338 - "POST /auth/token/login HTTP/1.1" 200 OK
INFO: 127.0.0.1:51346 - "POST /auth/token/login HTTP/1.1" 200 OK
INFO: 127.0.0.1:51324 - "POST /api/v1/project/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:51338 - "POST /api/v1/project/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:51346 - "POST /api/v1/project/apply/ HTTP/1.1" 202 Accepted
INFO: 127.0.0.1:43744 - "POST /auth/token/login HTTP/1.1" 200 OK
INFO: 127.0.0.1:43760 - "POST /auth/token/login HTTP/1.1" 200 OK
INFO: 127.0.0.1:43746 - "POST /auth/token/login HTTP/1.1" 200 OK
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1015, in _commit_impl
self.engine.dialect.do_commit(self.connection)
File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 685, in do_commit
dbapi_connection.commit()
sqlite3.OperationalError: database is locked
The above exception was the direct cause of the following exception:
...
And at this point the CLI client always hits a timeout error, as in the original comment of this issue.
This behavior is unexpected, I will debug it further.
To be checked:
I switched to a postgres db, and repeated the test (with batches of four examples).
It looks like the test runs smoothly, as visible in the logs up to this point (note the multiple apply
calls):
INFO: 127.0.0.1:34258 - "POST /api/v1/workflow/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:34274 - "POST /api/v1/workflow/2/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:34286 - "POST /api/v1/workflow/1/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:34302 - "POST /api/v1/workflow/3/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:34310 - "POST /api/v1/workflow/4/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:34320 - "POST /api/v1/workflow/2/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:34322 - "POST /api/v1/workflow/1/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:34332 - "POST /api/v1/workflow/3/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:34340 - "POST /api/v1/workflow/4/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:34354 - "POST /api/v1/workflow/2/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:34356 - "POST /api/v1/workflow/1/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:34360 - "POST /api/v1/workflow/3/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:34372 - "POST /api/v1/workflow/4/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:34380 - "POST /api/v1/workflow/2/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:34384 - "POST /api/v1/workflow/1/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:34396 - "POST /api/v1/workflow/3/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:34410 - "POST /api/v1/workflow/4/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:34422 - "POST /api/v1/workflow/2/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:34434 - "POST /api/v1/workflow/1/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:34440 - "POST /api/v1/workflow/4/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:34454 - "POST /api/v1/workflow/3/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:34466 - "POST /api/v1/workflow/2/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:34468 - "POST /api/v1/project/apply/ HTTP/1.1" 202 Accepted
2023-04-26 15:02:17,606 - WF1_job1 - INFO - Start execution of workflow "Workflow cardiac-test-110"; more logs at /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000001_wf_0000001_job_0000001/workflow.log
INFO: 127.0.0.1:45616 - "POST /auth/token/login HTTP/1.1" 200 OK
INFO: 127.0.0.1:45644 - "POST /auth/token/login HTTP/1.1" 200 OK
INFO: 127.0.0.1:45628 - "POST /auth/token/login HTTP/1.1" 200 OK
INFO: 127.0.0.1:45648 - "POST /auth/token/login HTTP/1.1" 200 OK
INFO: 127.0.0.1:45644 - "POST /api/v1/project/apply/ HTTP/1.1" 202 Accepted
2023-04-26 15:02:18,919 - WF2_job2 - INFO - Start execution of workflow "Workflow cardiac-test-120"; more logs at /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000002_wf_0000002_job_0000002/workflow.log
INFO: 127.0.0.1:45616 - "GET /api/v1/task/ HTTP/1.1" 200 OK
INFO: 127.0.0.1:45628 - "GET /api/v1/task/ HTTP/1.1" 200 OK
INFO: 127.0.0.1:45648 - "POST /api/v1/project/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:45616 - "POST /api/v1/workflow/4/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:45628 - "POST /api/v1/workflow/3/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:45652 - "POST /auth/token/login HTTP/1.1" 200 OK
INFO: 127.0.0.1:45658 - "POST /auth/token/login HTTP/1.1" 200 OK
INFO: 127.0.0.1:45652 - "POST /api/v1/project/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:45660 - "PATCH /api/v1/project/5/9 HTTP/1.1" 200 OK
INFO: 127.0.0.1:45662 - "POST /api/v1/project/apply/ HTTP/1.1" 202 Accepted
2023-04-26 15:02:20,132 - WF3_job3 - INFO - Start execution of workflow "Workflow cardiac-test-130"; more logs at /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000003_wf_0000003_job_0000003/workflow.log
INFO: 127.0.0.1:45658 - "POST /api/v1/project/apply/ HTTP/1.1" 202 Accepted
2023-04-26 15:02:20,238 - WF4_job4 - INFO - Start execution of workflow "Workflow cardiac-test-140"; more logs at /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000004_wf_0000004_job_0000004/workflow.log
INFO: 127.0.0.1:45672 - "POST /auth/token/login HTTP/1.1" 200 OK
INFO: 127.0.0.1:45688 - "POST /auth/token/login HTTP/1.1" 200 OK
INFO: 127.0.0.1:45684 - "POST /auth/token/login HTTP/1.1" 200 OK
INFO: 127.0.0.1:45674 - "POST /auth/token/login HTTP/1.1" 200 OK
INFO: 127.0.0.1:45688 - "POST /api/v1/project/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:45684 - "POST /api/v1/project/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:45672 - "POST /api/v1/project/5/9 HTTP/1.1" 201 Created
INFO: 127.0.0.1:45674 - "PATCH /api/v1/project/6/10 HTTP/1.1" 200 OK
INFO: 127.0.0.1:45700 - "POST /api/v1/project/5/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:45708 - "POST /api/v1/project/6/10 HTTP/1.1" 201 Created
INFO: 127.0.0.1:45704 - "PATCH /api/v1/project/8/12 HTTP/1.1" 200 OK
INFO: 127.0.0.1:45724 - "PATCH /api/v1/project/7/11 HTTP/1.1" 200 OK
INFO: 127.0.0.1:45728 - "PATCH /api/v1/project/5/13 HTTP/1.1" 200 OK
INFO: 127.0.0.1:45742 - "POST /api/v1/project/8/12 HTTP/1.1" 201 Created
INFO: 127.0.0.1:45754 - "POST /api/v1/project/6/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:45764 - "POST /api/v1/project/7/11 HTTP/1.1" 201 Created
INFO: 127.0.0.1:45776 - "POST /api/v1/project/5/13 HTTP/1.1" 201 Created
INFO: 127.0.0.1:45786 - "PATCH /api/v1/project/6/14 HTTP/1.1" 200 OK
INFO: 127.0.0.1:45792 - "POST /api/v1/project/7/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:45796 - "POST /api/v1/project/8/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:45798 - "POST /api/v1/workflow/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:45806 - "POST /api/v1/project/6/14 HTTP/1.1" 201 Created
INFO: 127.0.0.1:45822 - "PATCH /api/v1/project/7/15 HTTP/1.1" 200 OK
INFO: 127.0.0.1:45818 - "PATCH /api/v1/project/8/16 HTTP/1.1" 200 OK
INFO: 127.0.0.1:45836 - "GET /api/v1/task/ HTTP/1.1" 200 OK
INFO: 127.0.0.1:45836 - "POST /api/v1/workflow/5/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:45838 - "POST /api/v1/workflow/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:55478 - "POST /api/v1/project/7/15 HTTP/1.1" 201 Created
INFO: 127.0.0.1:55492 - "POST /api/v1/project/8/16 HTTP/1.1" 201 Created
INFO: 127.0.0.1:55508 - "POST /api/v1/workflow/5/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:55524 - "POST /api/v1/workflow/6/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:55530 - "POST /api/v1/workflow/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:55538 - "POST /api/v1/workflow/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:55548 - "POST /api/v1/workflow/6/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:55558 - "POST /api/v1/workflow/7/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:55570 - "POST /api/v1/workflow/5/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:55584 - "POST /api/v1/workflow/8/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:55596 - "POST /api/v1/workflow/8/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:55602 - "POST /api/v1/workflow/6/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:55616 - "POST /api/v1/workflow/7/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:55618 - "POST /api/v1/workflow/5/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:55622 - "POST /api/v1/workflow/8/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:55628 - "POST /api/v1/workflow/7/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:55636 - "POST /api/v1/workflow/5/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:55648 - "POST /api/v1/workflow/6/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:55660 - "POST /api/v1/workflow/8/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:55668 - "POST /api/v1/workflow/7/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:55672 - "POST /api/v1/workflow/5/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:55678 - "POST /api/v1/workflow/6/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:55682 - "POST /api/v1/workflow/8/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:55694 - "POST /api/v1/workflow/7/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:55704 - "POST /api/v1/project/apply/ HTTP/1.1" 202 Accepted
2023-04-26 15:02:35,723 - WF5_job5 - INFO - Start execution of workflow "Workflow cardiac-test-111"; more logs at /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000005_wf_0000005_job_0000005/workflow.log
INFO: 127.0.0.1:55718 - "POST /api/v1/workflow/7/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:55722 - "POST /api/v1/workflow/6/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:55732 - "POST /api/v1/workflow/8/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:55748 - "POST /api/v1/project/apply/ HTTP/1.1" 202 Accepted
2023-04-26 15:02:37,243 - WF7_job6 - INFO - Start execution of workflow "Workflow cardiac-test-131"; more logs at /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000007_wf_0000007_job_0000006/workflow.log
INFO: 127.0.0.1:47316 - "POST /api/v1/project/apply/ HTTP/1.1" 202 Accepted
2023-04-26 15:02:38,026 - WF6_job7 - INFO - Start execution of workflow "Workflow cardiac-test-121"; more logs at /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000006_wf_0000006_job_0000007/workflow.log
INFO: 127.0.0.1:47330 - "POST /api/v1/project/apply/ HTTP/1.1" 202 Accepted
However, we hit a different error after a while ("QueuePool limit of size ...", see https://docs.sqlalchemy.org/en/14/errors.html#error-3o7r), see logs:
...
INFO: 127.0.0.1:55704 - "POST /api/v1/project/apply/ HTTP/1.1" 202 Accepted
2023-04-26 15:02:35,723 - WF5_job5 - INFO - Start execution of workflow "Workflow cardiac-test-111"; more logs at /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000005_wf_0000005_job_0000005/workflow.log
INFO: 127.0.0.1:55718 - "POST /api/v1/workflow/7/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:55722 - "POST /api/v1/workflow/6/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:55732 - "POST /api/v1/workflow/8/add-task/ HTTP/1.1" 201 Created
INFO: 127.0.0.1:55748 - "POST /api/v1/project/apply/ HTTP/1.1" 202 Accepted
2023-04-26 15:02:37,243 - WF7_job6 - INFO - Start execution of workflow "Workflow cardiac-test-131"; more logs at /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000007_wf_0000007_job_0000006/workflow.log
INFO: 127.0.0.1:47316 - "POST /api/v1/project/apply/ HTTP/1.1" 202 Accepted
2023-04-26 15:02:38,026 - WF6_job7 - INFO - Start execution of workflow "Workflow cardiac-test-121"; more logs at /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000006_wf_0000006_job_0000007/workflow.log
INFO: 127.0.0.1:47330 - "POST /api/v1/project/apply/ HTTP/1.1" 202 Accepted
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/uvicorn/protocols/http/h11_impl.py", line 407, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
return await self.app(scope, receive, send)
File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/fastapi/applications.py", line 276, in __call__
await super().__call__(scope, receive, send)
File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/starlette/applications.py", line 122, in __call__
await self.middleware_stack(scope, receive, send)
File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/starlette/middleware/errors.py", line 184, in __call__
raise exc
File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/starlette/middleware/errors.py", line 162, in __call__
await self.app(scope, receive, _send)
File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/starlette/middleware/cors.py", line 84, in __call__
await self.app(scope, receive, send)
File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
raise exc
File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
await self.app(scope, receive, sender)
File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/fastapi/middleware/asyncexitstack.py", line 21, in __call__
raise e
File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
await self.app(scope, receive, send)
File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/starlette/routing.py", line 718, in __call__
await route.handle(scope, receive, send)
File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/starlette/routing.py", line 276, in handle
await self.app(scope, receive, send)
File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/starlette/routing.py", line 69, in app
await response(scope, receive, send)
File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/starlette/responses.py", line 174, in __call__
await self.background()
File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/starlette/background.py", line 43, in __call__
await task()
File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/starlette/background.py", line 26, in __call__
await self.func(*self.args, **self.kwargs)
File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/fractal_server/app/runner/__init__.py", line 111, in submit_workflow
job: ApplyWorkflow = db_sync.get(ApplyWorkflow, job_id) # type: ignore
File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 2785, in get
return self._get_impl(
File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 2892, in _get_impl
return db_load_fn(
File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/sqlalchemy/orm/loading.py", line 530, in load_on_pk_identity
session.execute(
File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 1691, in execute
conn = self._connection_for_bind(bind)
File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 1532, in _connection_for_bind
return self._transaction._connection_for_bind(
File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 747, in _connection_for_bind
conn = bind.connect()
File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/sqlalchemy/future/engine.py", line 406, in connect
return super(Engine, self).connect()
File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3234, in connect
return self._connection_cls(self, close_with_result=close_with_result)
File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 96, in __init__
else engine.raw_connection()
File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3313, in raw_connection
return self._wrap_pool_connect(self.pool.connect, _connection)
File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3280, in _wrap_pool_connect
return fn()
File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 310, in connect
return _ConnectionFairy._checkout(self)
File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 868, in _checkout
fairy = _ConnectionRecord.checkout(pool)
File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 476, in checkout
rec = pool._do_get()
File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/sqlalchemy/pool/impl.py", line 134, in _do_get
raise exc.TimeoutError(
sqlalchemy.exc.TimeoutError: QueuePool limit of size 5 overflow 10 reached, connection timed out, timeout 30.00 (Background on this error at: https://sqlalche.me/e/14/3o7r)
Note that in my tests the same machine runs fractal-server and postgres but also the actual tasks (including some cellpose task, applied to tiny datasets). And this machine quickly gets overloaded, due to tasks. The error could simply be a consequence of extremely degraded performance.
I'll now modify the workflow definition, so that it skips the heaviest task(s), and see whether things change.
And this machine quickly gets overloaded, due to tasks. The error could simply be a consequence of extremely degraded performance.
It turns out that my guess was wrong.
Here is what I did:
Note that having the N fractal apply
commands one after the other (in the same fractal-demos script) is useful to rapidly create multiple running jobs.
I tried with N=20, and one of the job submissions (number 8, in this case) failed as before (QueuePool limit of size 5 overflow 10 reached, connection timed out
). The other 19 went through.
From the CLI client, I see this as
$ fractal job list 1
Job list for project 1
┏━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ id ┃ start_timestamp ┃ status ┃ workflow_id ┃ working_dir ┃
┡━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 8 │ 2023-04-26 13:32:29 │ submitted │ 1 │ │
│ 1 │ 2023-04-26 13:32:24 │ done │ 1 │ /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000001_wf_0000001_job_0000001 │
│ 2 │ 2023-04-26 13:32:24 │ done │ 1 │ /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000001_wf_0000001_job_0000002 │
│ 9 │ 2023-04-26 13:33:00 │ done │ 1 │ /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000001_wf_0000001_job_0000009 │
│ 3 │ 2023-04-26 13:32:25 │ done │ 1 │ /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000001_wf_0000001_job_0000003 │
│ 4 │ 2023-04-26 13:32:26 │ done │ 1 │ /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000001_wf_0000001_job_0000004 │
│ 5 │ 2023-04-26 13:32:26 │ done │ 1 │ /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000001_wf_0000001_job_0000005 │
│ 6 │ 2023-04-26 13:32:27 │ done │ 1 │ /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000001_wf_0000001_job_0000006 │
│ 7 │ 2023-04-26 13:32:28 │ done │ 1 │ /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000001_wf_0000001_job_0000007 │
│ 10 │ 2023-04-26 13:33:01 │ done │ 1 │ /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000001_wf_0000001_job_0000010 │
│ 11 │ 2023-04-26 13:33:02 │ done │ 1 │ /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000001_wf_0000001_job_0000011 │
│ 12 │ 2023-04-26 13:33:02 │ done │ 1 │ /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000001_wf_0000001_job_0000012 │
│ 13 │ 2023-04-26 13:33:03 │ done │ 1 │ /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000001_wf_0000001_job_0000013 │
│ 14 │ 2023-04-26 13:33:04 │ done │ 1 │ /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000001_wf_0000001_job_0000014 │
│ 15 │ 2023-04-26 13:33:04 │ done │ 1 │ /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000001_wf_0000001_job_0000015 │
└────┴─────────────────────┴───────────┴─────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────┘
My guess remains that something is going wrong in the background process that executes the workflow.
Hmm, thanks for documenting @tcompa ! This is very important of course to dig in deeper, we'll likely see somewhat similar usage levels (though likely slightly staggered submissions) during the demo and in increased usage in the labs.
So far argues for using gunicorn with multiple workers, potentially another argument in favor of postgres, but still some open questions there as well regarding the sqlalchemy.exc.TimeoutError: QueuePool limit
.
Are there some requests (e.g. the apply with having to use sudo etc.) that block the database for an extended period of time?
we'll likely see somewhat similar usage levels (though likely slightly staggered submissions)
Yes, that's what makes this issue slightly less severe. If the we add a one-second interval between subsequent apply
calls, the issue is gone (that is, all 20 jobs are submitted). To be more explicit, this is the end of the script which goes through:
for ind in {1..20}; do
fractal --batch workflow apply -o $DS_OUT_ID -p $PRJ_ID $WF_ID $DS_IN_ID
sleep 1.0
done
while if we remove the sleep
command (or if we make it much smaller, like 0.3 seconds), then at least one of the API calls fails.
Having 20 people clicking "submit" within a second is unlikely, but we still have to make it work.
I'll clean up the example, removing whatever is not needed, and make some more systematic tests with the different approaches. For the record, the current test passes with no problem when using postgres+gunicorn (without any need for sleep
), so that gunicorn is probably the right way to go. Still, let's make sure we don't have any db connection which is not needed.
Are there some requests (e.g. the apply with having to use sudo etc.) that block the database for an extended period of time?
In principle there isn't any time-consuming request (and this is all with the local executor, so unrelated to sudo/slurm), but this is what I want to review.
For the record, the current test passes with no problem when using postgres+gunicorn
Ok, still worthwhile to investigate this a bit further, but then that's another argument for using postgres + gunicorn for the actual deployments and the demos. Good to be aware of! :)
Ok, still worthwhile to investigate this a bit further, but then that's another argument for using postgres + gunicorn
To be precise, sqlite+gunicorn show no problem (at least not in the current tests), so that postgres is not a hard requirement. And in general we are not even approaching the sqlite boundaries (they talk about 100k hits/day, or even 10 times more).
However, especially for UZH, we should make sure to use the new machine (https://github.com/fractal-analytics-platform/fractal-server/issues/615#issuecomment-1517538507), because:
SQLite will work over a network filesystem, but because of the latency associated with most network filesystems, performance will not be great. Also, file locking logic is buggy in many network filesystem implementations (on both Unix and Windows). If file locking does not work correctly, two or more clients might try to modify the same part of the same database at the same time, resulting in corruption. Because this problem results from bugs in the underlying filesystem implementation, there is nothing SQLite can do to prevent it (https://www.sqlite.org/whentouse.html)
For the record: more systematic tests are underway as part of https://github.com/fractal-analytics-platform/fractal-demos/pull/50/files. The current understanding suggests that postgres+gunicorn may mitigate the problem (up to a point where we don't expect our current scale to be an issue for the demos), but something is still unclear about the number of open DB connections.
Again, for the record (I'll do a proper PR later):
await db.close()
in the workflow-apply endpoint, after await db.refresh(job)
.Discussion in this issue lead to #651, and the current status of affairs is summarized in https://github.com/fractal-analytics-platform/fractal-server/issues/649#issuecomment-1527666450.
It's still possible to observe the original issue with 1.2.3 (if using sqlite), but it should be much harder to reproduce (e.g. it would require several calls to the apply-workflow endpoint to be made in a very short time interval).
Thank you so much @adrtsc, your testing exposed some serious bugs!
The database on the UZH deployment using server 1.2.2 became locked today. After Adrian ran into an error when providing a wrong json file during adding a task to a workflow (likely an error in the meta json file), the database became locked and any user using fractal client would see this error:
The reported original error in the workflow was:
Though I'm a bit surprised with that. I've seen this error multiple times when adding wrong task argument jsons and it never froze the server.
Are server logs written to disk anywhere? The scroll-back in the screen only revealed this type of error over & over for different requests by users to do any operation (even just
fractal user whoami
), but not the original error.This is the error that gets triggered every time a user tries to interact with the server via the client:
Restarting the server solved the issue, so it seems some process was not shut down correctly after an exception.
tl;dr: Let's keep an eye on this. Not sure how we can reproduce the error and whether it's another case of something that would get better with postgres. We should try adding a task to a workflow with a corrupt meta.json file (cc @adrtsc do you still have the one that caused the error for you?)