Open eudyptula opened 1 year ago
I manage to locate more information about what is going on inside the hanging docker container (the python -m prefect.engine
proces):
$ sudo -E env "PATH=$PATH" py-spy dump --pid 760906
Process 760906: python -m prefect.engine
Python v3.10.8 (/usr/local/bin/python3.10)
Thread 0x7F5BFED67740 (idle): "MainThread"
select (selectors.py:469)
_run_once (asyncio/base_events.py:1863)
run_forever (asyncio/base_events.py:603)
run_until_complete (asyncio/base_events.py:636)
_cancel_all_tasks (asyncio/runners.py:63)
run (asyncio/runners.py:47)
run (anyio/_backends/_asyncio.py:292)
run (anyio/_core/_eventloop.py:73)
enter_flow_run_engine_from_subprocess (prefect/engine.py:174)
<module> (prefect/engine.py:1602)
_run_code (runpy.py:86)
_run_module_as_main (runpy.py:196)
Thread 0x7F5BF8C57700 (active): "asyncio_0"
_worker (concurrent/futures/thread.py:81)
run (threading.py:953)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Thread 0x7F5BF3FFF700 (idle): "AnyIO worker thread"
wait (threading.py:320)
result (concurrent/futures/_base.py:453)
run_async_from_thread (anyio/_backends/_asyncio.py:970)
run (anyio/from_thread.py:49)
run_async_from_worker_thread (prefect/utilities/asyncutils.py:148)
enter_flow_run_engine_from_flow_call (prefect/engine.py:154)
__call__ (prefect/flows.py:439)
unit_heat_curve_forecast_flow (**our flow file**)
capture_worker_thread_and_result (prefect/utilities/asyncutils.py:108)
run (anyio/_backends/_asyncio.py:867)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Thread 0x7F5BD8AA1700 (idle): "orion-log-worker"
wait (threading.py:324)
wait (threading.py:607)
_send_logs_loop (prefect/logging/handlers.py:79)
run (threading.py:953)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Thread 0x7F5B707F0700 (active): "asyncio_1"
_worker (concurrent/futures/thread.py:81)
run (threading.py:953)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Thread 0x7F5B6FFEF700 (active): "asyncio_2"
_worker (concurrent/futures/thread.py:81)
run (threading.py:953)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Thread 0x7F5B6EFED700 (active): "asyncio_3"
_worker (concurrent/futures/thread.py:81)
run (threading.py:953)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Thread 0x7F5B6F7EE700 (active): "asyncio_4"
_worker (concurrent/futures/thread.py:81)
run (threading.py:953)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Still an issue as of 2.10.4 - as far as I can see it's related to the task runner not shutting down probably when everything crashes.
Stacktrace from the hanging docker container:
$ sudo -E env "PATH=$PATH" py-spy dump --pid 17111
Process 17111: python -m prefect.engine
Python v3.10.11 (/usr/local/bin/python3.10)
Thread 0x7F671748F740 (idle): "MainThread"
wait (threading.py:320)
get (queue.py:171)
_handle_waiting_callbacks (prefect/_internal/concurrency/waiters.py:88)
wait (prefect/_internal/concurrency/waiters.py:124)
wait_for_call_in_loop_thread (prefect/_internal/concurrency/api.py:136)
enter_task_run_engine (prefect/engine.py:972)
__call__ (prefect/tasks.py:485)
flow (prefect/flows/projects/domos_single_zone_controller.py:105)
_run_sync (prefect/_internal/concurrency/calls.py:194)
run (prefect/_internal/concurrency/calls.py:139)
_handle_waiting_callbacks (prefect/_internal/concurrency/waiters.py:96)
wait (prefect/_internal/concurrency/waiters.py:124)
wait_for_call_in_loop_thread (prefect/_internal/concurrency/api.py:136)
enter_flow_run_engine_from_subprocess (prefect/engine.py:202)
<module> (prefect/engine.py:2159)
_run_code (runpy.py:86)
_run_module_as_main (runpy.py:196)
Thread 0x7F671118B700 (idle): "GlobalEventLoopThread"
select (selectors.py:469)
_run_once (asyncio/base_events.py:1871)
run_forever (asyncio/base_events.py:603)
run_until_complete (asyncio/base_events.py:636)
run (asyncio/runners.py:44)
_entrypoint (prefect/_internal/concurrency/threads.py:190)
run (threading.py:953)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Thread 0x7F671098A700 (active): "asyncio_0"
_worker (concurrent/futures/thread.py:81)
run (threading.py:953)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Thread 0x7F670BFFF700 (idle): "AnyIO worker thread"
wait (threading.py:320)
get (queue.py:171)
run (anyio/_backends/_asyncio.py:857)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Thread 0x7F669AF07700 (active): "asyncio_1"
_worker (concurrent/futures/thread.py:81)
run (threading.py:953)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Thread 0x7F6698FF3700 (active): "asyncio_2"
_worker (concurrent/futures/thread.py:81)
run (threading.py:953)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Thread 0x7F6693FFF700 (active): "asyncio_3"
_worker (concurrent/futures/thread.py:81)
run (threading.py:953)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
@eudyptula do you have a MRE so I can debug this?
@madkinsz Unfortunately not on hand, but my best guess is that you'll need the following:
A flow that runs inside a docker container
The flow needs to create many tasks with map, your goal here is to overload the DB and trigger a TimeoutError() on the Prefect server, causing it too respond with a 500 Internal Server Error
. Maybe lower the timeout settings on the server to make it easier.
The timeouts themselves are covered by https://github.com/PrefectHQ/prefect/issues/9323, but they seem to trigger this issue. Last one I got, looked like the following in the agent logs. It managed to create 274 of about 1000 task runs before failing.
Apr 26 08:26:24 prefect-next sh[146907]: 08:26:23.128 | INFO | Flow run 'lambda5-landris' - Created task run 'Determine sun factors for location-1' for task 'Determine sun factors for location'
Apr 26 08:26:24 prefect-next sh[146907]: 08:26:23.130 | INFO | Flow run 'lambda5-landris' - Submitted task run 'Determine sun factors for location-1' for execution.
Apr 26 08:26:24 prefect-next sh[146907]: 08:26:23.135 | INFO | Flow run 'lambda5-landris' - Created task run 'Determine sun factors for location-2' for task 'Determine sun factors for location'
Apr 26 08:26:24 prefect-next sh[146907]: 08:26:23.137 | INFO | Flow run 'lambda5-landris' - Submitted task run 'Determine sun factors for location-2' for execution.
[...]
Apr 26 08:26:54 prefect-next sh[146907]: 08:26:53.947 | INFO | Flow run 'lambda5-landris' - Created task run 'Determine sun factors for location-N' for task 'Determine sun factors for location'
Apr 26 08:26:54 prefect-next sh[146907]: 08:26:53.948 | INFO | Flow run 'lambda5-landris' - Submitted task run 'Determine sun factors for location-N' for execution.
Apr 26 08:26:54 prefect-next sh[146907]: 08:26:53.953 | ERROR | Task run 'Determine sun factors for location-23' - Crash detected! Execution was interrupted by an unexpected exception: PrefectHTTPStatusError: Server error '500 Internal Server Error' for url 'http://.../api/flow_runs/62ee921d-c617-4abe-941e-860d3a95c507'
Which API call they fall on doesn't seem relevant, I also had them failing on /api/task_runs
for example.
The 500 errors seems to consistently leave the containers hanging for us.
I might try the other task runners later to see if it makes a difference...
DaskTaskRunner generally seem to perform much better than ConcurrentTaskRunner, but 500 Internal do cause hanging containers also. Just switched to task_runner=DaskTaskRunner()
, all default settings / no specific dask setup at all.
A dump from the hanging container with dask:
Process 169006: python -m prefect.engine
Python v3.10.11 (/usr/local/bin/python3.10)
Thread 0x7FC6E578B740 (idle): "MainThread"
wait (threading.py:320)
get (queue.py:171)
_handle_waiting_callbacks (prefect/_internal/concurrency/waiters.py:88)
wait (prefect/_internal/concurrency/waiters.py:124)
wait_for_call_in_loop_thread (prefect/_internal/concurrency/api.py:136)
enter_task_run_engine (prefect/engine.py:972)
__call__ (prefect/tasks.py:485)
flow (prefect/flows/control/temperature_forecast_controller.py:113)
_run_sync (prefect/_internal/concurrency/calls.py:194)
run (prefect/_internal/concurrency/calls.py:139)
_handle_waiting_callbacks (prefect/_internal/concurrency/waiters.py:96)
wait (prefect/_internal/concurrency/waiters.py:124)
wait_for_call_in_loop_thread (prefect/_internal/concurrency/api.py:136)
enter_flow_run_engine_from_subprocess (prefect/engine.py:202)
<module> (prefect/engine.py:2159)
_run_code (runpy.py:86)
_run_module_as_main (runpy.py:196)
Thread 0x7FC6C0184700 (idle): "GlobalEventLoopThread"
select (selectors.py:469)
_run_once (asyncio/base_events.py:1871)
run_forever (asyncio/base_events.py:603)
run_until_complete (asyncio/base_events.py:636)
run (asyncio/runners.py:44)
_entrypoint (prefect/_internal/concurrency/threads.py:190)
run (threading.py:953)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Thread 0x7FC6BF983700 (active): "asyncio_0"
_worker (concurrent/futures/thread.py:81)
run (threading.py:953)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Thread 0x7FC6BF14C700 (idle): "AnyIO worker thread"
wait (threading.py:320)
get (queue.py:171)
run (anyio/_backends/_asyncio.py:857)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Thread 0x7FC6698DA700 (active): "Profile"
_watch (distributed/profile.py:349)
run (threading.py:953)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Thread 0x7FC6690D4700 (idle): "AsyncProcess Dask Worker process (from Nanny) watch message queue"
wait (threading.py:320)
get (queue.py:171)
_watch_message_queue (distributed/process.py:230)
run (threading.py:953)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Thread 0x7FC6688D2700 (idle): "AsyncProcess Dask Worker process (from Nanny) watch message queue"
wait (threading.py:320)
get (queue.py:171)
_watch_message_queue (distributed/process.py:230)
run (threading.py:953)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Thread 0x7FC65BFFF700 (active): "AsyncProcess Dask Worker process (from Nanny) watch process join"
poll (multiprocessing/popen_fork.py:27)
wait (multiprocessing/popen_fork.py:43)
join (multiprocessing/process.py:149)
_watch_process (distributed/process.py:250)
run (threading.py:953)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Thread 0x7FC65B7FE700 (active): "AsyncProcess Dask Worker process (from Nanny) watch process join"
poll (multiprocessing/popen_fork.py:27)
wait (multiprocessing/popen_fork.py:43)
join (multiprocessing/process.py:149)
_watch_process (distributed/process.py:250)
run (threading.py:953)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Thread 0x7FC65AFFD700 (idle): "AsyncProcess Dask Worker process (from Nanny) watch message queue"
wait (threading.py:320)
get (queue.py:171)
_watch_message_queue (distributed/process.py:230)
run (threading.py:953)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Thread 0x7FC65A7FC700 (idle): "AsyncProcess Dask Worker process (from Nanny) watch message queue"
wait (threading.py:320)
get (queue.py:171)
_watch_message_queue (distributed/process.py:230)
run (threading.py:953)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Thread 0x7FC659FFB700 (active): "AsyncProcess Dask Worker process (from Nanny) watch process join"
poll (multiprocessing/popen_fork.py:27)
wait (multiprocessing/popen_fork.py:43)
join (multiprocessing/process.py:149)
_watch_process (distributed/process.py:250)
run (threading.py:953)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Thread 0x7FC6597FA700 (active): "AsyncProcess Dask Worker process (from Nanny) watch process join"
poll (multiprocessing/popen_fork.py:27)
wait (multiprocessing/popen_fork.py:43)
join (multiprocessing/process.py:149)
_watch_process (distributed/process.py:250)
run (threading.py:953)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Thread 0x7FC658FF9700 (active): "asyncio_1"
_worker (concurrent/futures/thread.py:81)
run (threading.py:953)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Thread 0x7FC63AD2A700 (active): "asyncio_2"
_worker (concurrent/futures/thread.py:81)
run (threading.py:953)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
After today's testing, the solution seem to be to increase the default timeout values, add retry on HTTP 500 to clients, and switch to the DaskTaskRunner.
Just a quick update. DaskTaskRunner left a container running with 100% CPU usage, which ended up causing several issues across VMs that was sharing the same host - including our Prefect production setup.
Reverted all the way back to SequentialTaskRunner... hopefully that is more stable than the other two.
This looks like a duplicate of https://github.com/PrefectHQ/prefect/issues/9229 - the forked subprocesses here are likely the cause.
First check
Bug summary
On the agent we have have long running containers that are never closed (seen 40 hours+):
Simple flows (few tasks, no sub flows, etc.) seems to run fine, but our more advanced flows (staring many tasks with map, etc.) are consistently crashing. Does seem that the containers sometimes stop correctly when a flow crash, and sometimes. We will be looking into the flows and whether we made some errors there, but either way a container should be stopped when a flow crash.
The container and server logs indicate that a HTTP 500 is caused by a database timeout - so I will try to increase PREFECT_ORION_DATABASE_TIMEOUT. Also, notice the successful calls before and after in the server logs.
Will try to update Prefect in the near future as well, but we're also occasionally experiencing the network issue (https://github.com/PrefectHQ/prefect/issues/7512) - so doing a little trial and error with versions at the moment.
Logs from docker
Logs from the server
Reproduction
Error
No response
Versions
Additional context
No response