Open EmilRex opened 6 months ago
We are seeing the same issue with our flows using ECS. We trigger the flow, it times out after 300 seconds trying to connect to ECS, but then seems to figure out the setup and resumes running. The UI does temporary show a "crashed" status, before recovering and finishing with a status "complete". Same exact Traceback but our time is 302 for all errors.
We first saw this error on March 5th, 2024.
Customers have reported that occasionally the ECS worker will mark a flow run as Crashed even though the flow run is actually Running, or possibly even Completed. This seems to happen randomly at a high volume of flow run submissions. Specifically it was recently observed in ~10 of ~150 flow runs which were submitted at the same time, with potentially other flow runs being submitted as well. The behavior is likely reproducible with proper scale.
Specifically the mechanism here is that the worker's submission timeout is exceeded, causing a Crashed state. However, since the flow run is able to successfully start, the flow run transitions on. The problem is that the worker doesn't observe the start, and it is unclear why.
The flow run logs in the UI will contain a message like the following (from
prefect.flow_runs.worker
):