PrefectHQ / prefect

Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
https://prefect.io
Apache License 2.0
15.84k stars 1.55k forks source link

Worker mistakenly marks flow runs as Crashed #13020

Open EmilRex opened 6 months ago

EmilRex commented 6 months ago

Customers have reported that occasionally the ECS worker will mark a flow run as Crashed even though the flow run is actually Running, or possibly even Completed. This seems to happen randomly at a high volume of flow run submissions. Specifically it was recently observed in ~10 of ~150 flow runs which were submitted at the same time, with potentially other flow runs being submitted as well. The behavior is likely reproducible with proper scale.

Specifically the mechanism here is that the worker's submission timeout is exceeded, causing a Crashed state. However, since the flow run is able to successfully start, the flow run transitions on. The problem is that the worker doesn't observe the start, and it is unclear why.

The flow run logs in the UI will contain a message like the following (from prefect.flow_runs.worker):

Failed to submit flow run '<FLOW-RUN-ID>' to infrastructure.
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/prefect/workers/base.py", line 834, in _submit_run_and_capture_errors
    result = await self.run(
             ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect_aws/workers/ecs_worker.py", line 538, in run
    ) = await run_sync_in_worker_thread(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect/utilities/asyncutils.py", line 91, in run_sync_in_worker_thread
    return await anyio.to_thread.run_sync(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect_aws/workers/ecs_worker.py", line 700, in _create_task_and_wait_for_start
    self._wait_for_task_start(
  File "/usr/local/lib/python3.11/site-packages/prefect_aws/workers/ecs_worker.py", line 898, in _wait_for_task_start
    for task in self._watch_task_run(
  File "/usr/local/lib/python3.11/site-packages/prefect_aws/workers/ecs_worker.py", line 1115, in _watch_task_run
    raise RuntimeError(
RuntimeError: Timed out after 495.322021484375s while watching task for status RUNNING.
jaceiverson commented 6 months ago

We are seeing the same issue with our flows using ECS. We trigger the flow, it times out after 300 seconds trying to connect to ECS, but then seems to figure out the setup and resumes running. The UI does temporary show a "crashed" status, before recovering and finishing with a status "complete". Same exact Traceback but our time is 302 for all errors.

We first saw this error on March 5th, 2024.