Database locked after unknown issue

jluethi commented 1 year ago

The database on the UZH deployment using server 1.2.2 became locked today. After Adrian ran into an error when providing a wrong json file during adding a task to a workflow (likely an error in the meta json file), the database became locked and any user using fractal client would see this error:

(fractal-client-1.2.0) atschan@pelkmanslab-slurm-master-001:~/fractal-demos/examples/09_add_custom_cellpose$ fractal project list
Traceback (most recent call last):
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/site-packages/httpcore/backends/asyncio.py", line 34, in read
    return await self._stream.receive(max_bytes=max_bytes)
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 1265, in receive
    await self._protocol.read_event.wait()
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/asyncio/locks.py", line 226, in wait
    await fut
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/site-packages/httpcore/_exceptions.py", line 10, in map_exceptions
    yield
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/site-packages/httpcore/backends/asyncio.py", line 36, in read
    return b""
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/site-packages/anyio/_core/_tasks.py", line 118, in __exit__
    raise TimeoutError
TimeoutError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/site-packages/httpx/_transports/default.py", line 60, in map_httpcore_exceptions
    yield
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/site-packages/httpx/_transports/default.py", line 353, in handle_async_request
    resp = await self._pool.handle_async_request(req)
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/site-packages/httpcore/_async/connection_pool.py", line 253, in handle_async_request
    raise exc
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/site-packages/httpcore/_async/connection_pool.py", line 237, in handle_async_request
    response = await connection.handle_async_request(request)
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/site-packages/httpcore/_async/connection.py", line 90, in handle_async_request
    return await self._connection.handle_async_request(request)
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/site-packages/httpcore/_async/http11.py", line 112, in handle_async_request
    raise exc
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/site-packages/httpcore/_async/http11.py", line 91, in handle_async_request
    ) = await self._receive_response_headers(**kwargs)
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/site-packages/httpcore/_async/http11.py", line 155, in _receive_response_headers
    event = await self._receive_event(timeout=timeout)
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/site-packages/httpcore/_async/http11.py", line 191, in _receive_event
    data = await self._network_stream.read(
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/site-packages/httpcore/backends/asyncio.py", line 36, in read
    return b""
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/contextlib.py", line 137, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
    raise to_exc(exc)
httpcore.ReadTimeout

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/bin/fractal", line 8, in <module>
    sys.exit(run())
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/site-packages/fractal/__main__.py", line 7, in run
    asyncio.run(main())
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/site-packages/fractal/client.py", line 114, in main
    interface = await handle()
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/site-packages/fractal/client.py", line 104, in handle
    interface = await handler(client, **kwargs)
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/site-packages/fractal/cmd/__init__.py", line 59, in project
    iface = await project_list(client, **kwargs)
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/site-packages/fractal/cmd/_project.py", line 56, in project_list
    res = await client.get(f"{settings.BASE_URL}/project/")
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/site-packages/fractal/authclient.py", line 108, in get
    headers=await self.auth.header(), *args, **kwargs
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/site-packages/fractal/authclient.py", line 74, in header
    token = await self.__call__()
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/site-packages/fractal/authclient.py", line 79, in __call__
    await self._get_fresh_token()
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/site-packages/fractal/authclient.py", line 36, in _get_fresh_token
    res = await self.client.post(
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/site-packages/httpx/_client.py", line 1848, in post
    return await self.request(
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/site-packages/httpx/_client.py", line 1533, in request
    return await self.send(request, auth=auth, follow_redirects=follow_redirects)
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/site-packages/httpx/_client.py", line 1620, in send
    response = await self._send_handling_auth(
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/site-packages/httpx/_client.py", line 1648, in _send_handling_auth
    response = await self._send_handling_redirects(
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/site-packages/httpx/_client.py", line 1685, in _send_handling_redirects
    response = await self._send_single_request(request)
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/site-packages/httpx/_client.py", line 1722, in _send_single_request
    response = await transport.handle_async_request(request)
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/site-packages/httpx/_transports/default.py", line 353, in handle_async_request
    resp = await self._pool.handle_async_request(req)
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/contextlib.py", line 137, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/site-packages/httpx/_transports/default.py", line 77, in map_httpcore_exceptions
    raise mapped_exc(message) from exc
httpx.ReadTimeout

The reported original error in the workflow was:

Traceback (most recent call last):
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/bin/fractal", line 8, in <module>
    sys.exit(run())
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/site-packages/fractal/__main__.py", line 7, in run
    asyncio.run(main())
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/site-packages/fractal/client.py", line 114, in main
    interface = await handle()
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/site-packages/fractal/client.py", line 104, in handle
    interface = await handler(client, **kwargs)
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/site-packages/fractal/cmd/__init__.py", line 152, in workflow
    iface = await workflow_add_task(client, batch=batch, **kwargs)
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/site-packages/fractal/cmd/_workflow.py", line 111, in workflow_add_task
    args = json.load(f)
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/json/__init__.py", line 293, in load
    return loads(fp.read(),
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/data/homes/atschan/.conda/envs/fractal-client-1.2.0/lib/python3.9/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Though I'm a bit surprised with that. I've seen this error multiple times when adding wrong task argument jsons and it never froze the server.

Are server logs written to disk anywhere? The scroll-back in the screen only revealed this type of error over & over for different requests by users to do any operation (even just fractal user whoami), but not the original error.

This is the error that gets triggered every time a user tries to interact with the server via the client:

ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1819, in _execute_contex
t
    self.dialect.do_execute(
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
    cursor.execute(statement, parameters)
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/sqlalchemy/dialects/sqlite/aiosqlite.py", line 100, in ex
ecute
    self._adapt_connection._handle_exception(error)
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/sqlalchemy/dialects/sqlite/aiosqlite.py", line 229, in _h
andle_exception
    raise error
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/sqlalchemy/dialects/sqlite/aiosqlite.py", line 82, in exe
cute
    self.await_(_cursor.execute(operation, parameters))
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/sqlalchemy/util/_concurrency_py3k.py", line 76, in await_
only
    return current.driver.switch(awaitable)
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/sqlalchemy/util/_concurrency_py3k.py", line 129, in green
let_spawn
    value = await result
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/aiosqlite/cursor.py", line 37, in execute
    await self._execute(self._cursor.execute, sql, parameters)
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/aiosqlite/cursor.py", line 31, in _execute
    return await self._conn._execute(fn, *args, **kwargs)
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/aiosqlite/core.py", line 129, in _execute
    return await future
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/aiosqlite/core.py", line 102, in run
    result = function()
sqlite3.OperationalError: database is locked

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/uvicorn/protocols/http/h11_impl.py", line 407, in run_asg
i
    result = await app(  # type: ignore[func-returns-value]
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
    return await self.app(scope, receive, send)
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/fastapi/applications.py", line 276, in __call__
    await super().__call__(scope, receive, send)
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/starlette/middleware/cors.py", line 84, in __call__
    await self.app(scope, receive, send)
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/fastapi/middleware/asyncexitstack.py", line 21, in __call__
    raise e
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
    await self.app(scope, receive, send)
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/starlette/routing.py", line 66, in app
    response = await func(request)
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/fastapi/routing.py", line 237, in app
    raw_response = await run_endpoint_function(
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/fastapi/routing.py", line 163, in run_endpoint_function
    return await dependant.call(**values)
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/fastapi_users/router/auth.py", line 58, in login
    user = await user_manager.authenticate(credentials)
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/fastapi_users/manager.py", line 641, in authenticate
    user = await self.get_by_email(credentials.username)
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/fastapi_users/manager.py", line 87, in get_by_email
    user = await self.user_db.get_by_email(user_email)
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/fastapi_users_db_sqlmodel/__init__.py", line 177, in get_by_email
    results = await self.session.execute(statement)
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/sqlalchemy/ext/asyncio/session.py", line 212, in execute
    result = await greenlet_spawn(
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/sqlalchemy/util/_concurrency_py3k.py", line 134, in greenlet_spawn
    result = context.throw(*sys.exc_info())
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 1692, in execute
    result = conn._execute_20(statement, params or {}, execution_options)
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1631, in _execute_20
    return meth(self, args_10style, kwargs_10style, execution_options)
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/sqlalchemy/sql/elements.py", line 325, in _execute_on_connection
    return connection._execute_clauseelement(
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1498, in _execute_clauseelement
    ret = self._execute_context(
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1862, in _execute_context
    self._handle_dbapi_exception(
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 2043, in _handle_dbapi_exception
    util.raise_(
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 207, in raise_
    raise exception
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1819, in _execute_context
    self.dialect.do_execute(
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
    cursor.execute(statement, parameters)
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/sqlalchemy/dialects/sqlite/aiosqlite.py", line 100, in execute
    self._adapt_connection._handle_exception(error)
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/sqlalchemy/dialects/sqlite/aiosqlite.py", line 229, in _handle_exception
    raise error
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/sqlalchemy/dialects/sqlite/aiosqlite.py", line 82, in execute
    self.await_(_cursor.execute(operation, parameters))
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/sqlalchemy/util/_concurrency_py3k.py", line 76, in await_only
    return current.driver.switch(awaitable)
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/sqlalchemy/util/_concurrency_py3k.py", line 129, in greenlet_spawn
    value = await result
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/aiosqlite/cursor.py", line 37, in execute
    await self._execute(self._cursor.execute, sql, parameters)
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/aiosqlite/cursor.py", line 31, in _execute
    return await self._conn._execute(fn, *args, **kwargs)
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/aiosqlite/core.py", line 129, in _execute
    return await future
  File "/data/homes/fractal/anaconda3/envs/fractal-server-joel-1.2.2/lib/python3.9/site-packages/aiosqlite/core.py", line 102, in run
    result = function()
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) database is locked
[SQL: SELECT user_oauth.id, user_oauth.email, user_oauth.hashed_password, user_oauth.is_active, user_oauth.is_superuser, user_oauth.is_verified, user_oauth.slurm_user, user_oauth.cache_dir
FROM user_oauth
WHERE lower(user_oauth.email) = lower(?)]
[parameters: ('joel.luethi@uzh.ch',)]
(Background on this error at: https://sqlalche.me/e/14/e3q8)

Restarting the server solved the issue, so it seems some process was not shut down correctly after an exception.

tl;dr: Let's keep an eye on this. Not sure how we can reproduce the error and whether it's another case of something that would get better with postgres. We should try adding a task to a workflow with a corrupt meta.json file (cc @adrtsc do you still have the one that caused the error for you?)

adrtsc commented 1 year ago

I found in the end that it was not an issue with the meta file put with the args_cellpose_segmentation.json. The mistake that triggered the json decoder error was the following:

fr{
    "level": 2, 
    "wavelength_id": "A01_C01", 
    "output_label_name": "nuclei", 
    "input_ROI_table": "FOV_ROI_table",
    "model_type": "nuclei",
    "diameter_level0": 120
}

I tried resubmitting the workflow with the erroneous args_cellpose_segmentation.json and I only got the json decoder error this time and the database was not locked. So it might have been a weird coincidence that it happened the first time.

jluethi commented 1 year ago

Ok, thanks @adrtsc ! Yes, these types of errors should cause a JSON decoding error and then the task isn't added to the workflow => shouldn't crash the server or block the database.

My suspicion is that something else than the JSONDecodeError happened, but I don't have access to the logs this far back atm and can't trace it.

Let's keep an eye on whether we find other issues that can lock the database and whether we can find the relevant logs on what happened in those cases.

tcompa commented 1 year ago

Thanks for reporting, this obviously should have not happened.

General questions:

Was the original error triggered from the command-line client or from the web client? In both cases, it's hard to imagine how a wrong JSON file would lead to this error. This is very clear e.g. for the command-line client: if it cannot read an args JSON file, it will fail and it will not make any call to fractal-server.
Do you happen to know if someone else was interacting with fractal-server at the same time?
Joel, is the locked-database file still there and the fractal-server instance still active? (I guess not, but otherwise we could have debugged a bit further)

it's another case of something that would get better with postgres.

That'd be my first guess, see e.g. https://docs.djangoproject.com/en/dev/ref/databases/#database-is-locked-errors. Let's quickly move towards using postgres in all instances.

Re: logs We don't store them to file (that would be part of #620), so there's no obvious way to retrieve them. For the moment the easiest workaround would be to redirect stdout/stderr of the fractalctl start command to a local file.

jluethi commented 1 year ago

Was the original error triggered from the command-line client or from the web client? In both cases, it's hard to imagine how a wrong JSON file would lead to this error. This is very clear e.g. for the command-line client: if it cannot read an args JSON file, it will fail and it will not make any call to fractal-server.

From the command line client. And I agree! It's just the only error Adrian observed.

Regarding logs: +1 for #620 then :) Once we have the logs and if we observe this kind of error again, let's dig into them.

For now, I don't think there's anything that needs to be done on this issue, let's see if we trigger it again using sqlite :)

tcompa commented 1 year ago

I reproduced this issue (or a related one) as in the following. I don't have a clear picture yet, but I'm setting up a way to debug.

I installed fractal-server (as in fractal-demos/example/server, but with version 1.2.2 and also with the gunicorn extra), and I ran it in two ways (always with a sqlite db):

With gunicorn, and several workers (e.g. 4 workers), with a command like gunicorn fractal_server.main:app --workers 4 --worker-class uvicorn.workers.UvicornWorker --bind 127.0.0.1:$PORT --access-logfile - > fractal_server.log 2>&1.
With the standard command fractalctl start

Then (after setting up the user/tasks, with the examples/00_user_setup/local_user_setup.sh script), I ran multiple simultaneous instances of 01_cardio_tiny_dataset/run_example.sh, in batches of size three (that is, I started three examples, three others that would start right after the first ones, and so on).

I used the first test (the one with gunicorn and 4 workers) to make sure that multiple requests are reaching the db simultaneously. I checked this via fuser db/fractal_server.db ("fuser displays the PIDs of processes using the specified files or file systems"), which at some point showed output like

$ fuser db/fractal_server.db

/some/path/fractal-demos/examples/server/db/fractal_server.db: 88860 88861 88863

where the three PIDs are part of the gunicorn ones, as per these logs

[2023-04-26 13:43:49 +0200] [88859] [INFO] Starting gunicorn 20.1.0
[2023-04-26 13:43:49 +0200] [88859] [INFO] Listening at: http://127.0.0.1:8010 (88859)
[2023-04-26 13:43:49 +0200] [88859] [INFO] Using worker: uvicorn.workers.UvicornWorker
[2023-04-26 13:43:49 +0200] [88860] [INFO] Booting worker with pid: 88860
[2023-04-26 13:43:49 +0200] [88861] [INFO] Booting worker with pid: 88861
[2023-04-26 13:43:49 +0200] [88863] [INFO] Booting worker with pid: 88863
[2023-04-26 13:43:49 +0200] [88864] [INFO] Booting worker with pid: 88864
...

In this way, I know that I am triggering multiple simultaneous db operations.

Also, this suggests that sqlite itself has no problem in handling this kind of workload (as expected.. see https://www.sqlite.org/whentouse.html)

I then re-ran the same example, but running fractal-server with the usual fractalctl start, and in this case fuser only listed a single process (as expected). However, I soon hit the locked-database error - see relevant part of the fractal-server logs below

...
INFO:     127.0.0.1:51250 - "POST /api/v1/workflow/1/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:51262 - "POST /api/v1/workflow/3/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:51278 - "POST /api/v1/workflow/2/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:51284 - "POST /api/v1/project/apply/ HTTP/1.1" 202 Accepted
INFO:     127.0.0.1:51300 - "POST /api/v1/workflow/3/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:51314 - "POST /api/v1/project/apply/ HTTP/1.1" 202 Accepted
INFO:     127.0.0.1:51324 - "POST /auth/token/login HTTP/1.1" 200 OK
INFO:     127.0.0.1:51338 - "POST /auth/token/login HTTP/1.1" 200 OK
INFO:     127.0.0.1:51346 - "POST /auth/token/login HTTP/1.1" 200 OK
INFO:     127.0.0.1:51324 - "POST /api/v1/project/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:51338 - "POST /api/v1/project/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:51346 - "POST /api/v1/project/apply/ HTTP/1.1" 202 Accepted
INFO:     127.0.0.1:43744 - "POST /auth/token/login HTTP/1.1" 200 OK
INFO:     127.0.0.1:43760 - "POST /auth/token/login HTTP/1.1" 200 OK
INFO:     127.0.0.1:43746 - "POST /auth/token/login HTTP/1.1" 200 OK
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1015, in _commit_impl
    self.engine.dialect.do_commit(self.connection)
  File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 685, in do_commit
    dbapi_connection.commit()
sqlite3.OperationalError: database is locked

The above exception was the direct cause of the following exception:
...

And at this point the CLI client always hits a timeout error, as in the original comment of this issue.

This behavior is unexpected, I will debug it further.

To be checked:

Is a single series of examples (one starting when the previous one reached the apply endpoint) enough to reproduce the error?
What if we use two workers in uvicorn? What if we have two workers but more than two simultaneous tests running?
Is there some fractal-server component that keeps an open connection to the db for a long time (main suspect: the workflow-execution background operation)?
...

tcompa commented 1 year ago

I switched to a postgres db, and repeated the test (with batches of four examples). It looks like the test runs smoothly, as visible in the logs up to this point (note the multiple apply calls):

INFO:     127.0.0.1:34258 - "POST /api/v1/workflow/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:34274 - "POST /api/v1/workflow/2/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:34286 - "POST /api/v1/workflow/1/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:34302 - "POST /api/v1/workflow/3/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:34310 - "POST /api/v1/workflow/4/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:34320 - "POST /api/v1/workflow/2/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:34322 - "POST /api/v1/workflow/1/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:34332 - "POST /api/v1/workflow/3/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:34340 - "POST /api/v1/workflow/4/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:34354 - "POST /api/v1/workflow/2/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:34356 - "POST /api/v1/workflow/1/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:34360 - "POST /api/v1/workflow/3/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:34372 - "POST /api/v1/workflow/4/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:34380 - "POST /api/v1/workflow/2/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:34384 - "POST /api/v1/workflow/1/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:34396 - "POST /api/v1/workflow/3/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:34410 - "POST /api/v1/workflow/4/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:34422 - "POST /api/v1/workflow/2/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:34434 - "POST /api/v1/workflow/1/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:34440 - "POST /api/v1/workflow/4/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:34454 - "POST /api/v1/workflow/3/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:34466 - "POST /api/v1/workflow/2/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:34468 - "POST /api/v1/project/apply/ HTTP/1.1" 202 Accepted
2023-04-26 15:02:17,606 - WF1_job1 - INFO - Start execution of workflow "Workflow cardiac-test-110"; more logs at /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000001_wf_0000001_job_0000001/workflow.log
INFO:     127.0.0.1:45616 - "POST /auth/token/login HTTP/1.1" 200 OK
INFO:     127.0.0.1:45644 - "POST /auth/token/login HTTP/1.1" 200 OK
INFO:     127.0.0.1:45628 - "POST /auth/token/login HTTP/1.1" 200 OK
INFO:     127.0.0.1:45648 - "POST /auth/token/login HTTP/1.1" 200 OK
INFO:     127.0.0.1:45644 - "POST /api/v1/project/apply/ HTTP/1.1" 202 Accepted
2023-04-26 15:02:18,919 - WF2_job2 - INFO - Start execution of workflow "Workflow cardiac-test-120"; more logs at /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000002_wf_0000002_job_0000002/workflow.log
INFO:     127.0.0.1:45616 - "GET /api/v1/task/ HTTP/1.1" 200 OK
INFO:     127.0.0.1:45628 - "GET /api/v1/task/ HTTP/1.1" 200 OK
INFO:     127.0.0.1:45648 - "POST /api/v1/project/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:45616 - "POST /api/v1/workflow/4/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:45628 - "POST /api/v1/workflow/3/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:45652 - "POST /auth/token/login HTTP/1.1" 200 OK
INFO:     127.0.0.1:45658 - "POST /auth/token/login HTTP/1.1" 200 OK
INFO:     127.0.0.1:45652 - "POST /api/v1/project/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:45660 - "PATCH /api/v1/project/5/9 HTTP/1.1" 200 OK
INFO:     127.0.0.1:45662 - "POST /api/v1/project/apply/ HTTP/1.1" 202 Accepted
2023-04-26 15:02:20,132 - WF3_job3 - INFO - Start execution of workflow "Workflow cardiac-test-130"; more logs at /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000003_wf_0000003_job_0000003/workflow.log
INFO:     127.0.0.1:45658 - "POST /api/v1/project/apply/ HTTP/1.1" 202 Accepted
2023-04-26 15:02:20,238 - WF4_job4 - INFO - Start execution of workflow "Workflow cardiac-test-140"; more logs at /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000004_wf_0000004_job_0000004/workflow.log
INFO:     127.0.0.1:45672 - "POST /auth/token/login HTTP/1.1" 200 OK
INFO:     127.0.0.1:45688 - "POST /auth/token/login HTTP/1.1" 200 OK
INFO:     127.0.0.1:45684 - "POST /auth/token/login HTTP/1.1" 200 OK
INFO:     127.0.0.1:45674 - "POST /auth/token/login HTTP/1.1" 200 OK
INFO:     127.0.0.1:45688 - "POST /api/v1/project/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:45684 - "POST /api/v1/project/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:45672 - "POST /api/v1/project/5/9 HTTP/1.1" 201 Created
INFO:     127.0.0.1:45674 - "PATCH /api/v1/project/6/10 HTTP/1.1" 200 OK
INFO:     127.0.0.1:45700 - "POST /api/v1/project/5/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:45708 - "POST /api/v1/project/6/10 HTTP/1.1" 201 Created
INFO:     127.0.0.1:45704 - "PATCH /api/v1/project/8/12 HTTP/1.1" 200 OK
INFO:     127.0.0.1:45724 - "PATCH /api/v1/project/7/11 HTTP/1.1" 200 OK
INFO:     127.0.0.1:45728 - "PATCH /api/v1/project/5/13 HTTP/1.1" 200 OK
INFO:     127.0.0.1:45742 - "POST /api/v1/project/8/12 HTTP/1.1" 201 Created
INFO:     127.0.0.1:45754 - "POST /api/v1/project/6/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:45764 - "POST /api/v1/project/7/11 HTTP/1.1" 201 Created
INFO:     127.0.0.1:45776 - "POST /api/v1/project/5/13 HTTP/1.1" 201 Created
INFO:     127.0.0.1:45786 - "PATCH /api/v1/project/6/14 HTTP/1.1" 200 OK
INFO:     127.0.0.1:45792 - "POST /api/v1/project/7/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:45796 - "POST /api/v1/project/8/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:45798 - "POST /api/v1/workflow/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:45806 - "POST /api/v1/project/6/14 HTTP/1.1" 201 Created
INFO:     127.0.0.1:45822 - "PATCH /api/v1/project/7/15 HTTP/1.1" 200 OK
INFO:     127.0.0.1:45818 - "PATCH /api/v1/project/8/16 HTTP/1.1" 200 OK
INFO:     127.0.0.1:45836 - "GET /api/v1/task/ HTTP/1.1" 200 OK
INFO:     127.0.0.1:45836 - "POST /api/v1/workflow/5/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:45838 - "POST /api/v1/workflow/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:55478 - "POST /api/v1/project/7/15 HTTP/1.1" 201 Created
INFO:     127.0.0.1:55492 - "POST /api/v1/project/8/16 HTTP/1.1" 201 Created
INFO:     127.0.0.1:55508 - "POST /api/v1/workflow/5/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:55524 - "POST /api/v1/workflow/6/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:55530 - "POST /api/v1/workflow/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:55538 - "POST /api/v1/workflow/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:55548 - "POST /api/v1/workflow/6/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:55558 - "POST /api/v1/workflow/7/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:55570 - "POST /api/v1/workflow/5/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:55584 - "POST /api/v1/workflow/8/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:55596 - "POST /api/v1/workflow/8/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:55602 - "POST /api/v1/workflow/6/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:55616 - "POST /api/v1/workflow/7/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:55618 - "POST /api/v1/workflow/5/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:55622 - "POST /api/v1/workflow/8/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:55628 - "POST /api/v1/workflow/7/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:55636 - "POST /api/v1/workflow/5/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:55648 - "POST /api/v1/workflow/6/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:55660 - "POST /api/v1/workflow/8/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:55668 - "POST /api/v1/workflow/7/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:55672 - "POST /api/v1/workflow/5/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:55678 - "POST /api/v1/workflow/6/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:55682 - "POST /api/v1/workflow/8/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:55694 - "POST /api/v1/workflow/7/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:55704 - "POST /api/v1/project/apply/ HTTP/1.1" 202 Accepted
2023-04-26 15:02:35,723 - WF5_job5 - INFO - Start execution of workflow "Workflow cardiac-test-111"; more logs at /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000005_wf_0000005_job_0000005/workflow.log
INFO:     127.0.0.1:55718 - "POST /api/v1/workflow/7/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:55722 - "POST /api/v1/workflow/6/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:55732 - "POST /api/v1/workflow/8/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:55748 - "POST /api/v1/project/apply/ HTTP/1.1" 202 Accepted
2023-04-26 15:02:37,243 - WF7_job6 - INFO - Start execution of workflow "Workflow cardiac-test-131"; more logs at /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000007_wf_0000007_job_0000006/workflow.log
INFO:     127.0.0.1:47316 - "POST /api/v1/project/apply/ HTTP/1.1" 202 Accepted
2023-04-26 15:02:38,026 - WF6_job7 - INFO - Start execution of workflow "Workflow cardiac-test-121"; more logs at /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000006_wf_0000006_job_0000007/workflow.log
INFO:     127.0.0.1:47330 - "POST /api/v1/project/apply/ HTTP/1.1" 202 Accepted

However, we hit a different error after a while ("QueuePool limit of size ...", see https://docs.sqlalchemy.org/en/14/errors.html#error-3o7r), see logs:

...
INFO:     127.0.0.1:55704 - "POST /api/v1/project/apply/ HTTP/1.1" 202 Accepted
2023-04-26 15:02:35,723 - WF5_job5 - INFO - Start execution of workflow "Workflow cardiac-test-111"; more logs at /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000005_wf_0000005_job_0000005/workflow.log
INFO:     127.0.0.1:55718 - "POST /api/v1/workflow/7/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:55722 - "POST /api/v1/workflow/6/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:55732 - "POST /api/v1/workflow/8/add-task/ HTTP/1.1" 201 Created
INFO:     127.0.0.1:55748 - "POST /api/v1/project/apply/ HTTP/1.1" 202 Accepted
2023-04-26 15:02:37,243 - WF7_job6 - INFO - Start execution of workflow "Workflow cardiac-test-131"; more logs at /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000007_wf_0000007_job_0000006/workflow.log
INFO:     127.0.0.1:47316 - "POST /api/v1/project/apply/ HTTP/1.1" 202 Accepted
2023-04-26 15:02:38,026 - WF6_job7 - INFO - Start execution of workflow "Workflow cardiac-test-121"; more logs at /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000006_wf_0000006_job_0000007/workflow.log
INFO:     127.0.0.1:47330 - "POST /api/v1/project/apply/ HTTP/1.1" 202 Accepted
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/uvicorn/protocols/http/h11_impl.py", line 407, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
    return await self.app(scope, receive, send)
  File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/fastapi/applications.py", line 276, in __call__
    await super().__call__(scope, receive, send)
  File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/starlette/middleware/cors.py", line 84, in __call__
    await self.app(scope, receive, send)
  File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/fastapi/middleware/asyncexitstack.py", line 21, in __call__
    raise e
  File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
    await self.app(scope, receive, send)
  File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/starlette/routing.py", line 69, in app
    await response(scope, receive, send)
  File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/starlette/responses.py", line 174, in __call__
    await self.background()
  File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/starlette/background.py", line 43, in __call__
    await task()
  File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/starlette/background.py", line 26, in __call__
    await self.func(*self.args, **self.kwargs)
  File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/fractal_server/app/runner/__init__.py", line 111, in submit_workflow
    job: ApplyWorkflow = db_sync.get(ApplyWorkflow, job_id)  # type: ignore
  File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 2785, in get
    return self._get_impl(
  File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 2892, in _get_impl
    return db_load_fn(
  File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/sqlalchemy/orm/loading.py", line 530, in load_on_pk_identity
    session.execute(
  File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 1691, in execute
    conn = self._connection_for_bind(bind)
  File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 1532, in _connection_for_bind
    return self._transaction._connection_for_bind(
  File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 747, in _connection_for_bind
    conn = bind.connect()
  File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/sqlalchemy/future/engine.py", line 406, in connect
    return super(Engine, self).connect()
  File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3234, in connect
    return self._connection_cls(self, close_with_result=close_with_result)
  File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 96, in __init__
    else engine.raw_connection()
  File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3313, in raw_connection
    return self._wrap_pool_connect(self.pool.connect, _connection)
  File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3280, in _wrap_pool_connect
    return fn()
  File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 310, in connect
    return _ConnectionFairy._checkout(self)
  File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 868, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
  File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 476, in checkout
    rec = pool._do_get()
  File "/home/tommaso/miniconda3/envs/fractal-server-1.2.2/lib/python3.9/site-packages/sqlalchemy/pool/impl.py", line 134, in _do_get
    raise exc.TimeoutError(
sqlalchemy.exc.TimeoutError: QueuePool limit of size 5 overflow 10 reached, connection timed out, timeout 30.00 (Background on this error at: https://sqlalche.me/e/14/3o7r)

Note that in my tests the same machine runs fractal-server and postgres but also the actual tasks (including some cellpose task, applied to tiny datasets). And this machine quickly gets overloaded, due to tasks. The error could simply be a consequence of extremely degraded performance.

I'll now modify the workflow definition, so that it skips the heaviest task(s), and see whether things change.

tcompa commented 1 year ago

And this machine quickly gets overloaded, due to tasks. The error could simply be a consequence of extremely degraded performance.

It turns out that my guess was wrong.

Here is what I did:

Instead of example 01, I switched to example 99 (that is, the stress-test example). One example run now consists in
- Creating a workflow that runs a dummy task for 400 components. None of this tasks is computationally heavy.
- Running this same workflow (without re-creating any DB object) N times in a row.

Note that having the N fractal apply commands one after the other (in the same fractal-demos script) is useful to rapidly create multiple running jobs.

I tried with N=20, and one of the job submissions (number 8, in this case) failed as before (QueuePool limit of size 5 overflow 10 reached, connection timed out). The other 19 went through. From the CLI client, I see this as

$ fractal job list 1
                                                                  Job list for project 1                                                                  
┏━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ id ┃   start_timestamp   ┃  status   ┃ workflow_id ┃                                            working_dir                                            ┃
┡━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 8  │ 2023-04-26 13:32:29 │ submitted │      1      │                                                                                                   │
│ 1  │ 2023-04-26 13:32:24 │   done    │      1      │ /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000001_wf_0000001_job_0000001 │
│ 2  │ 2023-04-26 13:32:24 │   done    │      1      │ /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000001_wf_0000001_job_0000002 │
│ 9  │ 2023-04-26 13:33:00 │   done    │      1      │ /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000001_wf_0000001_job_0000009 │
│ 3  │ 2023-04-26 13:32:25 │   done    │      1      │ /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000001_wf_0000001_job_0000003 │
│ 4  │ 2023-04-26 13:32:26 │   done    │      1      │ /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000001_wf_0000001_job_0000004 │
│ 5  │ 2023-04-26 13:32:26 │   done    │      1      │ /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000001_wf_0000001_job_0000005 │
│ 6  │ 2023-04-26 13:32:27 │   done    │      1      │ /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000001_wf_0000001_job_0000006 │
│ 7  │ 2023-04-26 13:32:28 │   done    │      1      │ /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000001_wf_0000001_job_0000007 │
│ 10 │ 2023-04-26 13:33:01 │   done    │      1      │ /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000001_wf_0000001_job_0000010 │
│ 11 │ 2023-04-26 13:33:02 │   done    │      1      │ /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000001_wf_0000001_job_0000011 │
│ 12 │ 2023-04-26 13:33:02 │   done    │      1      │ /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000001_wf_0000001_job_0000012 │
│ 13 │ 2023-04-26 13:33:03 │   done    │      1      │ /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000001_wf_0000001_job_0000013 │
│ 14 │ 2023-04-26 13:33:04 │   done    │      1      │ /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000001_wf_0000001_job_0000014 │
│ 15 │ 2023-04-26 13:33:04 │   done    │      1      │ /home/tommaso/Fractal/fractal-demos/examples/server/artifacts/proj_0000001_wf_0000001_job_0000015 │
└────┴─────────────────────┴───────────┴─────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────┘

My guess remains that something is going wrong in the background process that executes the workflow.

jluethi commented 1 year ago

Hmm, thanks for documenting @tcompa ! This is very important of course to dig in deeper, we'll likely see somewhat similar usage levels (though likely slightly staggered submissions) during the demo and in increased usage in the labs.

So far argues for using gunicorn with multiple workers, potentially another argument in favor of postgres, but still some open questions there as well regarding the sqlalchemy.exc.TimeoutError: QueuePool limit.

Are there some requests (e.g. the apply with having to use sudo etc.) that block the database for an extended period of time?

tcompa commented 1 year ago

we'll likely see somewhat similar usage levels (though likely slightly staggered submissions)

Yes, that's what makes this issue slightly less severe. If the we add a one-second interval between subsequent apply calls, the issue is gone (that is, all 20 jobs are submitted). To be more explicit, this is the end of the script which goes through:

for ind in {1..20}; do
    fractal --batch workflow apply -o $DS_OUT_ID -p $PRJ_ID $WF_ID $DS_IN_ID
    sleep 1.0
done

while if we remove the sleep command (or if we make it much smaller, like 0.3 seconds), then at least one of the API calls fails.

Having 20 people clicking "submit" within a second is unlikely, but we still have to make it work.

I'll clean up the example, removing whatever is not needed, and make some more systematic tests with the different approaches. For the record, the current test passes with no problem when using postgres+gunicorn (without any need for sleep), so that gunicorn is probably the right way to go. Still, let's make sure we don't have any db connection which is not needed.

Are there some requests (e.g. the apply with having to use sudo etc.) that block the database for an extended period of time?

In principle there isn't any time-consuming request (and this is all with the local executor, so unrelated to sudo/slurm), but this is what I want to review.

jluethi commented 1 year ago

For the record, the current test passes with no problem when using postgres+gunicorn

Ok, still worthwhile to investigate this a bit further, but then that's another argument for using postgres + gunicorn for the actual deployments and the demos. Good to be aware of! :)

tcompa commented 1 year ago

Ok, still worthwhile to investigate this a bit further, but then that's another argument for using postgres + gunicorn

To be precise, sqlite+gunicorn show no problem (at least not in the current tests), so that postgres is not a hard requirement. And in general we are not even approaching the sqlite boundaries (they talk about 100k hits/day, or even 10 times more).

However, especially for UZH, we should make sure to use the new machine (https://github.com/fractal-analytics-platform/fractal-server/issues/615#issuecomment-1517538507), because:

SQLite will work over a network filesystem, but because of the latency associated with most network filesystems, performance will not be great. Also, file locking logic is buggy in many network filesystem implementations (on both Unix and Windows). If file locking does not work correctly, two or more clients might try to modify the same part of the same database at the same time, resulting in corruption. Because this problem results from bugs in the underlying filesystem implementation, there is nothing SQLite can do to prevent it (https://www.sqlite.org/whentouse.html)

tcompa commented 1 year ago

For the record: more systematic tests are underway as part of https://github.com/fractal-analytics-platform/fractal-demos/pull/50/files. The current understanding suggests that postgres+gunicorn may mitigate the problem (up to a point where we don't expect our current scale to be an issue for the demos), but something is still unclear about the number of open DB connections.

tcompa commented 1 year ago

Again, for the record (I'll do a proper PR later):

There may be a missing await db.close() in the workflow-apply endpoint, after await db.refresh(job).
The get-task endpoint could also be made async

tcompa commented 1 year ago

Discussion in this issue lead to #651, and the current status of affairs is summarized in https://github.com/fractal-analytics-platform/fractal-server/issues/649#issuecomment-1527666450.

It's still possible to observe the original issue with 1.2.3 (if using sqlite), but it should be much harder to reproduce (e.g. it would require several calls to the apply-workflow endpoint to be made in a very short time interval).

Thank you so much @adrtsc, your testing exposed some serious bugs!

fractal-analytics-platform / fractal-server

Database locked after unknown issue #647