Docker containers are kept open when they crash

eudyptula commented 1 year ago

First check

[X] I added a descriptive title to this issue.
[X] I used the GitHub search to find a similar issue and didn't find it.
[X] I searched the Prefect documentation for this issue.
[X] I checked that this issue is related to Prefect and not one of its dependencies.

Bug summary

On the agent we have have long running containers that are never closed (seen 40 hours+):

$ sudo docker ps --format "table {{.ID}}\t{{.RunningFor}}"
CONTAINER ID   CREATED
cb679d5009d4   11 hours ago
a88e0b7627a6   12 hours ago

Simple flows (few tasks, no sub flows, etc.) seems to run fine, but our more advanced flows (staring many tasks with map, etc.) are consistently crashing. Does seem that the containers sometimes stop correctly when a flow crash, and sometimes. We will be looking into the flows and whether we made some errors there, but either way a container should be stopped when a flow crash.

The container and server logs indicate that a HTTP 500 is caused by a database timeout - so I will try to increase PREFECT_ORION_DATABASE_TIMEOUT. Also, notice the successful calls before and after in the server logs.

Will try to update Prefect in the near future as well, but we're also occasionally experiencing the network issue (https://github.com/PrefectHQ/prefect/issues/7512) - so doing a little trial and error with versions at the moment.

Logs from docker

$ sudo docker logs cb679d5009d4
[...]
20:34:34.684 | ERROR   | Task run 'Creating a run input from unit_id-****' - Crash detected! Execution was cancelled by the runtime environment.
20:35:05.089 | ERROR   | Flow run 'thoughtful-kakapo' - Crash detected! Execution was interrupted by an unexpected exception: PrefectHTTPStatusError: Server error '500 Internal Server Error' for url '****'
Response: {'exception_message': 'Internal Server Error'}
For more information check: https://httpstatuses.com/500

20:35:35.166 | ERROR   | Flow run 'gigantic-porcupine' - Encountered exception during execution:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/httpcore/_synchronization.py", line 38, in wait
    await self._event.wait()
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 1842, in wait
    if await self._event.wait():
  File "/usr/local/lib/python3.10/asyncio/locks.py", line 214, in wait
    await fut
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/httpcore/_exceptions.py", line 8, in map_exceptions
    yield
  File "/usr/local/lib/python3.10/site-packages/httpcore/_synchronization.py", line 37, in wait
    with anyio.fail_after(timeout):
  File "/usr/local/lib/python3.10/site-packages/anyio/_core/_tasks.py", line 118, in __exit__
    raise TimeoutError
TimeoutError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/httpx/_transports/default.py", line 60, in map_httpcore_exceptions
    yield
  File "/usr/local/lib/python3.10/site-packages/httpx/_transports/default.py", line 353, in handle_async_request
    resp = await self._pool.handle_async_request(req)
  File "/usr/local/lib/python3.10/site-packages/httpcore/_async/connection_pool.py", line 234, in handle_async_request
    raise exc
  File "/usr/local/lib/python3.10/site-packages/httpcore/_async/connection_pool.py", line 227, in handle_async_request
    connection = await status.wait_for_connection(timeout=timeout)
  File "/usr/local/lib/python3.10/site-packages/httpcore/_async/connection_pool.py", line 34, in wait_for_connection
    await self._connection_acquired.wait(timeout=timeout)
  File "/usr/local/lib/python3.10/site-packages/httpcore/_synchronization.py", line 36, in wait
    with map_exceptions(exc_map):
  File "/usr/local/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/usr/local/lib/python3.10/site-packages/httpcore/_exceptions.py", line 12, in map_exceptions
    raise to_exc(exc)
httpcore.PoolTimeout

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/prefect/client/utilities.py", line 47, in with_injected_client
    return await fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/prefect/engine.py", line 291, in retrieve_flow_then_begin_flow_run
    return await begin_flow_run(
  File "/usr/local/lib/python3.10/site-packages/prefect/engine.py", line 325, in begin_flow_run
    async with AsyncExitStack() as stack:
  File "/usr/local/lib/python3.10/contextlib.py", line 714, in __aexit__
    raise exc_details[1]
  File "/usr/local/lib/python3.10/contextlib.py", line 697, in __aexit__
    cb_suppress = await cb(*exc_details)
  File "/usr/local/lib/python3.10/contextlib.py", line 217, in __aexit__
    await self.gen.athrow(typ, value, traceback)
  File "/usr/local/lib/python3.10/site-packages/prefect/engine.py", line 1310, in report_flow_run_crashes
    await client.set_flow_run_state(
  File "/usr/local/lib/python3.10/site-packages/prefect/client/orion.py", line 1489, in set_flow_run_state
    response = await self._client.post(
  File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1842, in post
    return await self.request(
  File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1527, in request
    return await self.send(request, auth=auth, follow_redirects=follow_redirects)
  File "/usr/local/lib/python3.10/site-packages/prefect/client/base.py", line 159, in send
    await super().send(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1614, in send
    response = await self._send_handling_auth(
  File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1642, in _send_handling_auth
    response = await self._send_handling_redirects(
  File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1679, in _send_handling_redirects
    response = await self._send_single_request(request)
  File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1716, in _send_single_request
    response = await transport.handle_async_request(request)
  File "/usr/local/lib/python3.10/site-packages/httpx/_transports/default.py", line 352, in handle_async_request
    with map_httpcore_exceptions():
  File "/usr/local/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/usr/local/lib/python3.10/site-packages/httpx/_transports/default.py", line 77, in map_httpcore_exceptions
    raise mapped_exc(message) from exc
httpx.PoolTimeout

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 287, in wrapper
    return await func(*args)
  File "/usr/local/lib/python3.10/site-packages/prefect/client/utilities.py", line 45, in with_injected_client
    async with client_context as new_client:
  File "/usr/local/lib/python3.10/site-packages/prefect/client/orion.py", line 1862, in __aexit__
    return await self._exit_stack.__aexit__(*exc_info)
  File "/usr/local/lib/python3.10/contextlib.py", line 714, in __aexit__
    raise exc_details[1]
  File "/usr/local/lib/python3.10/contextlib.py", line 697, in __aexit__
    cb_suppress = await cb(*exc_details)
  File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1997, in __aexit__
    await self._transport.__aexit__(exc_type, exc_value, traceback)
  File "/usr/local/lib/python3.10/site-packages/httpx/_transports/default.py", line 332, in __aexit__
    await self._pool.__aexit__(exc_type, exc_value, traceback)
  File "/usr/local/lib/python3.10/site-packages/httpcore/_async/connection_pool.py", line 326, in __aexit__
    await self.aclose()
  File "/usr/local/lib/python3.10/site-packages/httpcore/_async/connection_pool.py", line 312, in aclose
    raise RuntimeError(
RuntimeError: The connection pool was closed while 397 HTTP requests/responses were still in-flight.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/prefect/engine.py", line 578, in orchestrate_flow_run
    result = await run_sync(flow_call)
  File "/usr/local/lib/python3.10/site-packages/prefect/utilities/asyncutils.py", line 136, in run_sync_in_interruptible_worker_thread
    assert result is not NotSet
AssertionError

Logs from the server

Nov 16 20:34:33 ch-33p sh[2557]: INFO:     **** - "POST /task_runs/ HTTP/1.1" 201 Created
Nov 16 20:34:33 ch-33p sh[2557]: Encountered exception in request:
Nov 16 20:34:33 ch-33p sh[2557]: Traceback (most recent call last):
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
Nov 16 20:34:33 ch-33p sh[2557]:     await self.app(scope, receive, _send)
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
Nov 16 20:34:33 ch-33p sh[2557]:     raise exc
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
Nov 16 20:34:33 ch-33p sh[2557]:     await self.app(scope, receive, sender)
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 21, in __call__
Nov 16 20:34:33 ch-33p sh[2557]:     raise e
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
Nov 16 20:34:33 ch-33p sh[2557]:     await self.app(scope, receive, send)
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/starlette/routing.py", line 706, in __call__
Nov 16 20:34:33 ch-33p sh[2557]:     await route.handle(scope, receive, send)
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle
Nov 16 20:34:33 ch-33p sh[2557]:     await self.app(scope, receive, send)
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/starlette/routing.py", line 66, in app
Nov 16 20:34:33 ch-33p sh[2557]:     response = await func(request)
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/prefect/orion/utilities/server.py", line 101, in handle_response_scoped_depends
Nov 16 20:34:33 ch-33p sh[2557]:     response = await default_handler(request)
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/fastapi/routing.py", line 235, in app
Nov 16 20:34:33 ch-33p sh[2557]:     raw_response = await run_endpoint_function(
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/fastapi/routing.py", line 161, in run_endpoint_function
Nov 16 20:34:33 ch-33p sh[2557]:     return await dependant.call(**values)
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/prefect/orion/api/task_runs.py", line 49, in create_task_run
Nov 16 20:34:33 ch-33p sh[2557]:     model = await models.task_runs.create_task_run(
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/prefect/orion/database/dependencies.py", line 117, in async_wrapper
Nov 16 20:34:33 ch-33p sh[2557]:     return await fn(*args, **kwargs)
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/prefect/orion/models/task_runs.py", line 73, in create_task_run
Nov 16 20:34:33 ch-33p sh[2557]:     result = await session.execute(query)
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/sqlalchemy/ext/asyncio/session.py", line 214, in execute
Nov 16 20:34:33 ch-33p sh[2557]:     result = await greenlet_spawn(
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/sqlalchemy/util/_concurrency_py3k.py", line 126, in greenlet_spawn
Nov 16 20:34:33 ch-33p sh[2557]:     result = context.throw(*sys.exc_info())
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1714, in execute
Nov 16 20:34:33 ch-33p sh[2557]:     result = conn._execute_20(statement, params or {}, execution_options)
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1705, in _execute_20
Nov 16 20:34:33 ch-33p sh[2557]:     return meth(self, args_10style, kwargs_10style, execution_options)
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/sqlalchemy/sql/elements.py", line 334, in _execute_on_connection
Nov 16 20:34:33 ch-33p sh[2557]:     return connection._execute_clauseelement(
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1572, in _execute_clauseelement
Nov 16 20:34:33 ch-33p sh[2557]:     ret = self._execute_context(
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1943, in _execute_context
Nov 16 20:34:33 ch-33p sh[2557]:     self._handle_dbapi_exception(
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 2128, in _handle_dbapi_exception
Nov 16 20:34:33 ch-33p sh[2557]:     util.raise_(exc_info[1], with_traceback=exc_info[2])
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 210, in raise_
Nov 16 20:34:33 ch-33p sh[2557]:     raise exception
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1900, in _execute_context
Nov 16 20:34:33 ch-33p sh[2557]:     self.dialect.do_execute(
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 736, in do_execute
Nov 16 20:34:33 ch-33p sh[2557]:     cursor.execute(statement, parameters)
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/sqlalchemy/dialects/postgresql/asyncpg.py", line 479, in execute
Nov 16 20:34:33 ch-33p sh[2557]:     self._adapt_connection.await_(
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/sqlalchemy/util/_concurrency_py3k.py", line 68, in await_only
Nov 16 20:34:33 ch-33p sh[2557]:     return current.driver.switch(awaitable)
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/sqlalchemy/util/_concurrency_py3k.py", line 121, in greenlet_spawn
Nov 16 20:34:33 ch-33p sh[2557]:     value = await result
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/sqlalchemy/dialects/postgresql/asyncpg.py", line 454, in _prepare_and_execute
Nov 16 20:34:33 ch-33p sh[2557]:     self._handle_exception(error)
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/sqlalchemy/dialects/postgresql/asyncpg.py", line 389, in _handle_exception
Nov 16 20:34:33 ch-33p sh[2557]:     self._adapt_connection._handle_exception(error)
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/sqlalchemy/dialects/postgresql/asyncpg.py", line 684, in _handle_exception
Nov 16 20:34:33 ch-33p sh[2557]:     raise error
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/sqlalchemy/dialects/postgresql/asyncpg.py", line 418, in _prepare_and_execute
Nov 16 20:34:33 ch-33p sh[2557]:     prepared_stmt, attributes = await adapt_connection._prepare(
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/sqlalchemy/dialects/postgresql/asyncpg.py", line 660, in _prepare
Nov 16 20:34:33 ch-33p sh[2557]:     prepared_stmt = await self._connection.prepare(operation)
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/asyncpg/connection.py", line 565, in prepare
Nov 16 20:34:33 ch-33p sh[2557]:     return await self._prepare(
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/asyncpg/connection.py", line 583, in _prepare
Nov 16 20:34:33 ch-33p sh[2557]:     stmt = await self._get_statement(
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/asyncpg/connection.py", line 412, in _get_statement
Nov 16 20:34:33 ch-33p sh[2557]:     types, intro_stmt = await self._introspect_types(
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/asyncpg/connection.py", line 458, in _introspect_types
Nov 16 20:34:33 ch-33p sh[2557]:     return await self.__execute(
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/asyncpg/connection.py", line 1683, in __execute
Nov 16 20:34:33 ch-33p sh[2557]:     return await self._do_execute(
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/asyncpg/connection.py", line 1733, in _do_execute
Nov 16 20:34:33 ch-33p sh[2557]:     result = await executor(stmt, timeout)
Nov 16 20:34:33 ch-33p sh[2557]:   File "asyncpg/protocol/protocol.pyx", line 201, in bind_execute
Nov 16 20:34:33 ch-33p sh[2557]: asyncio.exceptions.TimeoutError
Nov 16 20:34:33 ch-33p sh[2557]: INFO:     **** - "POST /task_runs/ HTTP/1.1" 500 Internal Server Error
Nov 16 20:34:33 ch-33p sh[2557]: ERROR:    Exception in ASGI application
Nov 16 20:34:33 ch-33p sh[2557]: Traceback (most recent call last):
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 407, in run_asgi
Nov 16 20:34:33 ch-33p sh[2557]:     result = await app(  # type: ignore[func-returns-value]
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
Nov 16 20:34:33 ch-33p sh[2557]:     return await self.app(scope, receive, send)
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/fastapi/applications.py", line 270, in __call__
Nov 16 20:34:33 ch-33p sh[2557]:     await super().__call__(scope, receive, send)
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/starlette/applications.py", line 124, in __call__
Nov 16 20:34:33 ch-33p sh[2557]:     await self.middleware_stack(scope, receive, send)
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
Nov 16 20:34:33 ch-33p sh[2557]:     raise exc
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
Nov 16 20:34:33 ch-33p sh[2557]:     await self.app(scope, receive, _send)
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/starlette/middleware/cors.py", line 84, in __call__
Nov 16 20:34:33 ch-33p sh[2557]:     await self.app(scope, receive, send)
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
Nov 16 20:34:33 ch-33p sh[2557]:     raise exc
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
Nov 16 20:34:33 ch-33p sh[2557]:     await self.app(scope, receive, sender)
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 21, in __call__
Nov 16 20:34:33 ch-33p sh[2557]:     raise e
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
Nov 16 20:34:33 ch-33p sh[2557]:     await self.app(scope, receive, send)
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/starlette/routing.py", line 706, in __call__
Nov 16 20:34:33 ch-33p sh[2557]:     await route.handle(scope, receive, send)
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/starlette/routing.py", line 443, in handle
Nov 16 20:34:33 ch-33p sh[2557]:     await self.app(scope, receive, send)
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/fastapi/applications.py", line 270, in __call__
Nov 16 20:34:33 ch-33p sh[2557]:     await super().__call__(scope, receive, send)
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/starlette/applications.py", line 124, in __call__
Nov 16 20:34:33 ch-33p sh[2557]:     await self.middleware_stack(scope, receive, send)
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
Nov 16 20:34:33 ch-33p sh[2557]:     raise exc
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
Nov 16 20:34:33 ch-33p sh[2557]:     await self.app(scope, receive, _send)
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
Nov 16 20:34:33 ch-33p sh[2557]:     raise exc
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
Nov 16 20:34:33 ch-33p sh[2557]:     await self.app(scope, receive, sender)
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 21, in __call__
Nov 16 20:34:33 ch-33p sh[2557]:     raise e
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
Nov 16 20:34:33 ch-33p sh[2557]:     await self.app(scope, receive, send)
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/starlette/routing.py", line 706, in __call__
Nov 16 20:34:33 ch-33p sh[2557]:     await route.handle(scope, receive, send)
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle
Nov 16 20:34:33 ch-33p sh[2557]:     await self.app(scope, receive, send)
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/starlette/routing.py", line 66, in app
Nov 16 20:34:33 ch-33p sh[2557]:     response = await func(request)
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/prefect/orion/utilities/server.py", line 101, in handle_response_scoped_depends
Nov 16 20:34:33 ch-33p sh[2557]:     response = await default_handler(request)
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/fastapi/routing.py", line 235, in app
Nov 16 20:34:33 ch-33p sh[2557]:     raw_response = await run_endpoint_function(
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/fastapi/routing.py", line 161, in run_endpoint_function
Nov 16 20:34:33 ch-33p sh[2557]:     return await dependant.call(**values)
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/prefect/orion/api/task_runs.py", line 49, in create_task_run
Nov 16 20:34:33 ch-33p sh[2557]:     model = await models.task_runs.create_task_run(
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/prefect/orion/database/dependencies.py", line 117, in async_wrapper
Nov 16 20:34:33 ch-33p sh[2557]:     return await fn(*args, **kwargs)
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/prefect/orion/models/task_runs.py", line 73, in create_task_run
Nov 16 20:34:33 ch-33p sh[2557]:     result = await session.execute(query)
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/sqlalchemy/ext/asyncio/session.py", line 214, in execute
Nov 16 20:34:33 ch-33p sh[2557]:     result = await greenlet_spawn(
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/sqlalchemy/util/_concurrency_py3k.py", line 126, in greenlet_spawn
Nov 16 20:34:33 ch-33p sh[2557]:     result = context.throw(*sys.exc_info())
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1714, in execute
Nov 16 20:34:33 ch-33p sh[2557]:     result = conn._execute_20(statement, params or {}, execution_options)
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1705, in _execute_20
Nov 16 20:34:33 ch-33p sh[2557]:     return meth(self, args_10style, kwargs_10style, execution_options)
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/sqlalchemy/sql/elements.py", line 334, in _execute_on_connection
Nov 16 20:34:33 ch-33p sh[2557]:     return connection._execute_clauseelement(
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1572, in _execute_clauseelement
Nov 16 20:34:33 ch-33p sh[2557]:     ret = self._execute_context(
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1943, in _execute_context
Nov 16 20:34:33 ch-33p sh[2557]:     self._handle_dbapi_exception(
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 2128, in _handle_dbapi_exception
Nov 16 20:34:33 ch-33p sh[2557]:     util.raise_(exc_info[1], with_traceback=exc_info[2])
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 210, in raise_
Nov 16 20:34:33 ch-33p sh[2557]:     raise exception
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1900, in _execute_context
Nov 16 20:34:33 ch-33p sh[2557]:     self.dialect.do_execute(
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 736, in do_execute
Nov 16 20:34:33 ch-33p sh[2557]:     cursor.execute(statement, parameters)
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/sqlalchemy/dialects/postgresql/asyncpg.py", line 479, in execute
Nov 16 20:34:33 ch-33p sh[2557]:     self._adapt_connection.await_(
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/sqlalchemy/util/_concurrency_py3k.py", line 68, in await_only
Nov 16 20:34:33 ch-33p sh[2557]:     return current.driver.switch(awaitable)
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/sqlalchemy/util/_concurrency_py3k.py", line 121, in greenlet_spawn
Nov 16 20:34:33 ch-33p sh[2557]:     value = await result
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/sqlalchemy/dialects/postgresql/asyncpg.py", line 454, in _prepare_and_execute
Nov 16 20:34:33 ch-33p sh[2557]:     self._handle_exception(error)
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/sqlalchemy/dialects/postgresql/asyncpg.py", line 389, in _handle_exception
Nov 16 20:34:33 ch-33p sh[2557]:     self._adapt_connection._handle_exception(error)
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/sqlalchemy/dialects/postgresql/asyncpg.py", line 684, in _handle_exception
Nov 16 20:34:33 ch-33p sh[2557]:     raise error
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/sqlalchemy/dialects/postgresql/asyncpg.py", line 418, in _prepare_and_execute
Nov 16 20:34:33 ch-33p sh[2557]:     prepared_stmt, attributes = await adapt_connection._prepare(
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/sqlalchemy/dialects/postgresql/asyncpg.py", line 660, in _prepare
Nov 16 20:34:33 ch-33p sh[2557]:     prepared_stmt = await self._connection.prepare(operation)
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/asyncpg/connection.py", line 565, in prepare
Nov 16 20:34:33 ch-33p sh[2557]:     return await self._prepare(
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/asyncpg/connection.py", line 583, in _prepare
Nov 16 20:34:33 ch-33p sh[2557]:     stmt = await self._get_statement(
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/asyncpg/connection.py", line 412, in _get_statement
Nov 16 20:34:33 ch-33p sh[2557]:     types, intro_stmt = await self._introspect_types(
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/asyncpg/connection.py", line 458, in _introspect_types
Nov 16 20:34:33 ch-33p sh[2557]:     return await self.__execute(
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/asyncpg/connection.py", line 1683, in __execute
Nov 16 20:34:33 ch-33p sh[2557]:     return await self._do_execute(
Nov 16 20:34:33 ch-33p sh[2557]:   File "/srv/prefect-2.6.4/lib/python3.10/site-packages/asyncpg/connection.py", line 1733, in _do_execute
Nov 16 20:34:33 ch-33p sh[2557]:     result = await executor(stmt, timeout)
Nov 16 20:34:33 ch-33p sh[2557]:   File "asyncpg/protocol/protocol.pyx", line 201, in bind_execute
Nov 16 20:34:33 ch-33p sh[2557]: asyncio.exceptions.TimeoutError
Nov 16 20:34:33 ch-33p sh[2557]: INFO:     **** - "POST /task_runs/ HTTP/1.1" 201 Created

Reproduction

# Install and run prefect...

Error

No response

Versions

# Server version
$ sh -c ". /srv/prefect-2.6.4/bin/activate ; prefect version"
Version:             2.6.4
API version:         0.8.2
Python version:      3.10.6
Git commit:          51e92dda
Built:               Thu, Oct 20, 2022 3:11 PM
OS/Arch:             linux/x86_64
Profile:             default
Server type:         <client error>

# Agent Version
# Using:
# - https://github.com/PrefectHQ/prefect/pull/7361
# - https://github.com/PrefectHQ/prefect/pull/7362
$ sh -c ". /srv/prefect-2.6.4/bin/activate ; prefect version"
Version:             2.6.4+4.g32040d38c
API version:         0.8.2
Python version:      3.10.6
Git commit:          32040d38
Built:               Wed, Nov 16, 2022 9:06 AM
OS/Arch:             linux/x86_64
Profile:             default
Server type:         ephemeral
Server:
  Database:          sqlite
  SQLite version:    3.37.2

# Container version
# Custom built with `prefecthq/prefect:2.6.4-python3.10` as base.
$ sudo docker exec cb679d5009d4 sh -c "prefect version"
Version:             2.6.4
API version:         0.8.2
Python version:      3.10.8
Git commit:          51e92dda
Built:               Thu, Oct 20, 2022 3:11 PM
OS/Arch:             linux/x86_64
Profile:             default
Server type:         hosted

Additional context

No response

eudyptula commented 1 year ago

I manage to locate more information about what is going on inside the hanging docker container (the python -m prefect.engine proces):

$ sudo -E env "PATH=$PATH" py-spy dump --pid 760906
Process 760906: python -m prefect.engine
Python v3.10.8 (/usr/local/bin/python3.10)

Thread 0x7F5BFED67740 (idle): "MainThread"
    select (selectors.py:469)
    _run_once (asyncio/base_events.py:1863)
    run_forever (asyncio/base_events.py:603)
    run_until_complete (asyncio/base_events.py:636)
    _cancel_all_tasks (asyncio/runners.py:63)
    run (asyncio/runners.py:47)
    run (anyio/_backends/_asyncio.py:292)
    run (anyio/_core/_eventloop.py:73)
    enter_flow_run_engine_from_subprocess (prefect/engine.py:174)
    <module> (prefect/engine.py:1602)
    _run_code (runpy.py:86)
    _run_module_as_main (runpy.py:196)
Thread 0x7F5BF8C57700 (active): "asyncio_0"
    _worker (concurrent/futures/thread.py:81)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 0x7F5BF3FFF700 (idle): "AnyIO worker thread"
    wait (threading.py:320)
    result (concurrent/futures/_base.py:453)
    run_async_from_thread (anyio/_backends/_asyncio.py:970)
    run (anyio/from_thread.py:49)
    run_async_from_worker_thread (prefect/utilities/asyncutils.py:148)
    enter_flow_run_engine_from_flow_call (prefect/engine.py:154)
    __call__ (prefect/flows.py:439)
    unit_heat_curve_forecast_flow (**our flow file**)
    capture_worker_thread_and_result (prefect/utilities/asyncutils.py:108)
    run (anyio/_backends/_asyncio.py:867)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 0x7F5BD8AA1700 (idle): "orion-log-worker"
    wait (threading.py:324)
    wait (threading.py:607)
    _send_logs_loop (prefect/logging/handlers.py:79)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 0x7F5B707F0700 (active): "asyncio_1"
    _worker (concurrent/futures/thread.py:81)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 0x7F5B6FFEF700 (active): "asyncio_2"
    _worker (concurrent/futures/thread.py:81)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 0x7F5B6EFED700 (active): "asyncio_3"
    _worker (concurrent/futures/thread.py:81)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 0x7F5B6F7EE700 (active): "asyncio_4"
    _worker (concurrent/futures/thread.py:81)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)

eudyptula commented 1 year ago

Still an issue as of 2.10.4 - as far as I can see it's related to the task runner not shutting down probably when everything crashes.

Stacktrace from the hanging docker container:

$ sudo -E env "PATH=$PATH" py-spy dump --pid 17111
Process 17111: python -m prefect.engine
Python v3.10.11 (/usr/local/bin/python3.10)

Thread 0x7F671748F740 (idle): "MainThread"
    wait (threading.py:320)
    get (queue.py:171)
    _handle_waiting_callbacks (prefect/_internal/concurrency/waiters.py:88)
    wait (prefect/_internal/concurrency/waiters.py:124)
    wait_for_call_in_loop_thread (prefect/_internal/concurrency/api.py:136)
    enter_task_run_engine (prefect/engine.py:972)
    __call__ (prefect/tasks.py:485)
    flow (prefect/flows/projects/domos_single_zone_controller.py:105)
    _run_sync (prefect/_internal/concurrency/calls.py:194)
    run (prefect/_internal/concurrency/calls.py:139)
    _handle_waiting_callbacks (prefect/_internal/concurrency/waiters.py:96)
    wait (prefect/_internal/concurrency/waiters.py:124)
    wait_for_call_in_loop_thread (prefect/_internal/concurrency/api.py:136)
    enter_flow_run_engine_from_subprocess (prefect/engine.py:202)
    <module> (prefect/engine.py:2159)
    _run_code (runpy.py:86)
    _run_module_as_main (runpy.py:196)
Thread 0x7F671118B700 (idle): "GlobalEventLoopThread"
    select (selectors.py:469)
    _run_once (asyncio/base_events.py:1871)
    run_forever (asyncio/base_events.py:603)
    run_until_complete (asyncio/base_events.py:636)
    run (asyncio/runners.py:44)
    _entrypoint (prefect/_internal/concurrency/threads.py:190)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 0x7F671098A700 (active): "asyncio_0"
    _worker (concurrent/futures/thread.py:81)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 0x7F670BFFF700 (idle): "AnyIO worker thread"
    wait (threading.py:320)
    get (queue.py:171)
    run (anyio/_backends/_asyncio.py:857)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 0x7F669AF07700 (active): "asyncio_1"
    _worker (concurrent/futures/thread.py:81)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 0x7F6698FF3700 (active): "asyncio_2"
    _worker (concurrent/futures/thread.py:81)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 0x7F6693FFF700 (active): "asyncio_3"
    _worker (concurrent/futures/thread.py:81)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)

zanieb commented 1 year ago

@eudyptula do you have a MRE so I can debug this?

eudyptula commented 1 year ago

@madkinsz Unfortunately not on hand, but my best guess is that you'll need the following:

A flow that runs inside a docker container
The flow needs to create many tasks with map, your goal here is to overload the DB and trigger a TimeoutError() on the Prefect server, causing it too respond with a 500 Internal Server Error. Maybe lower the timeout settings on the server to make it easier.

The timeouts themselves are covered by https://github.com/PrefectHQ/prefect/issues/9323, but they seem to trigger this issue. Last one I got, looked like the following in the agent logs. It managed to create 274 of about 1000 task runs before failing.

Apr 26 08:26:24 prefect-next sh[146907]: 08:26:23.128 | INFO    | Flow run 'lambda5-landris' - Created task run 'Determine sun factors for location-1' for task 'Determine sun factors for location'
Apr 26 08:26:24 prefect-next sh[146907]: 08:26:23.130 | INFO    | Flow run 'lambda5-landris' - Submitted task run 'Determine sun factors for location-1' for execution.
Apr 26 08:26:24 prefect-next sh[146907]: 08:26:23.135 | INFO    | Flow run 'lambda5-landris' - Created task run 'Determine sun factors for location-2' for task 'Determine sun factors for location'
Apr 26 08:26:24 prefect-next sh[146907]: 08:26:23.137 | INFO    | Flow run 'lambda5-landris' - Submitted task run 'Determine sun factors for location-2' for execution.
[...]
Apr 26 08:26:54 prefect-next sh[146907]: 08:26:53.947 | INFO    | Flow run 'lambda5-landris' - Created task run 'Determine sun factors for location-N' for task 'Determine sun factors for location'
Apr 26 08:26:54 prefect-next sh[146907]: 08:26:53.948 | INFO    | Flow run 'lambda5-landris' - Submitted task run 'Determine sun factors for location-N' for execution.
Apr 26 08:26:54 prefect-next sh[146907]: 08:26:53.953 | ERROR   | Task run 'Determine sun factors for location-23' - Crash detected! Execution was interrupted by an unexpected exception: PrefectHTTPStatusError: Server error '500 Internal Server Error' for url 'http://.../api/flow_runs/62ee921d-c617-4abe-941e-860d3a95c507'

Which API call they fall on doesn't seem relevant, I also had them failing on /api/task_runs for example.

The 500 errors seems to consistently leave the containers hanging for us.

I might try the other task runners later to see if it makes a difference...

eudyptula commented 1 year ago

DaskTaskRunner generally seem to perform much better than ConcurrentTaskRunner, but 500 Internal do cause hanging containers also. Just switched to task_runner=DaskTaskRunner(), all default settings / no specific dask setup at all.

A dump from the hanging container with dask:

Process 169006: python -m prefect.engine
Python v3.10.11 (/usr/local/bin/python3.10)

Thread 0x7FC6E578B740 (idle): "MainThread"
    wait (threading.py:320)
    get (queue.py:171)
    _handle_waiting_callbacks (prefect/_internal/concurrency/waiters.py:88)
    wait (prefect/_internal/concurrency/waiters.py:124)
    wait_for_call_in_loop_thread (prefect/_internal/concurrency/api.py:136)
    enter_task_run_engine (prefect/engine.py:972)
    __call__ (prefect/tasks.py:485)
    flow (prefect/flows/control/temperature_forecast_controller.py:113)
    _run_sync (prefect/_internal/concurrency/calls.py:194)
    run (prefect/_internal/concurrency/calls.py:139)
    _handle_waiting_callbacks (prefect/_internal/concurrency/waiters.py:96)
    wait (prefect/_internal/concurrency/waiters.py:124)
    wait_for_call_in_loop_thread (prefect/_internal/concurrency/api.py:136)
    enter_flow_run_engine_from_subprocess (prefect/engine.py:202)
    <module> (prefect/engine.py:2159)
    _run_code (runpy.py:86)
    _run_module_as_main (runpy.py:196)
Thread 0x7FC6C0184700 (idle): "GlobalEventLoopThread"
    select (selectors.py:469)
    _run_once (asyncio/base_events.py:1871)
    run_forever (asyncio/base_events.py:603)
    run_until_complete (asyncio/base_events.py:636)
    run (asyncio/runners.py:44)
    _entrypoint (prefect/_internal/concurrency/threads.py:190)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 0x7FC6BF983700 (active): "asyncio_0"
    _worker (concurrent/futures/thread.py:81)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 0x7FC6BF14C700 (idle): "AnyIO worker thread"
    wait (threading.py:320)
    get (queue.py:171)
    run (anyio/_backends/_asyncio.py:857)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 0x7FC6698DA700 (active): "Profile"
    _watch (distributed/profile.py:349)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 0x7FC6690D4700 (idle): "AsyncProcess Dask Worker process (from Nanny) watch message queue"
    wait (threading.py:320)
    get (queue.py:171)
    _watch_message_queue (distributed/process.py:230)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 0x7FC6688D2700 (idle): "AsyncProcess Dask Worker process (from Nanny) watch message queue"
    wait (threading.py:320)
    get (queue.py:171)
    _watch_message_queue (distributed/process.py:230)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 0x7FC65BFFF700 (active): "AsyncProcess Dask Worker process (from Nanny) watch process join"
    poll (multiprocessing/popen_fork.py:27)
    wait (multiprocessing/popen_fork.py:43)
    join (multiprocessing/process.py:149)
    _watch_process (distributed/process.py:250)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 0x7FC65B7FE700 (active): "AsyncProcess Dask Worker process (from Nanny) watch process join"
    poll (multiprocessing/popen_fork.py:27)
    wait (multiprocessing/popen_fork.py:43)
    join (multiprocessing/process.py:149)
    _watch_process (distributed/process.py:250)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 0x7FC65AFFD700 (idle): "AsyncProcess Dask Worker process (from Nanny) watch message queue"
    wait (threading.py:320)
    get (queue.py:171)
    _watch_message_queue (distributed/process.py:230)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 0x7FC65A7FC700 (idle): "AsyncProcess Dask Worker process (from Nanny) watch message queue"
    wait (threading.py:320)
    get (queue.py:171)
    _watch_message_queue (distributed/process.py:230)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 0x7FC659FFB700 (active): "AsyncProcess Dask Worker process (from Nanny) watch process join"
    poll (multiprocessing/popen_fork.py:27)
    wait (multiprocessing/popen_fork.py:43)
    join (multiprocessing/process.py:149)
    _watch_process (distributed/process.py:250)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 0x7FC6597FA700 (active): "AsyncProcess Dask Worker process (from Nanny) watch process join"
    poll (multiprocessing/popen_fork.py:27)
    wait (multiprocessing/popen_fork.py:43)
    join (multiprocessing/process.py:149)
    _watch_process (distributed/process.py:250)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 0x7FC658FF9700 (active): "asyncio_1"
    _worker (concurrent/futures/thread.py:81)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 0x7FC63AD2A700 (active): "asyncio_2"
    _worker (concurrent/futures/thread.py:81)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)

After today's testing, the solution seem to be to increase the default timeout values, add retry on HTTP 500 to clients, and switch to the DaskTaskRunner.

eudyptula commented 1 year ago

Just a quick update. DaskTaskRunner left a container running with 100% CPU usage, which ended up causing several issues across VMs that was sharing the same host - including our Prefect production setup.

Reverted all the way back to SequentialTaskRunner... hopefully that is more stable than the other two.

rmorshea commented 1 year ago

This looks like a duplicate of https://github.com/PrefectHQ/prefect/issues/9229 - the forked subprocesses here are likely the cause.

PrefectHQ / prefect