Closed ArGJolan closed 8 months ago
I am also getting this error is my Process Workers. I have an auto scaling Pod of Process workers which pulls from a queue. This specific error causes the container to crash and be restarted.
I have a pretty high volume so I is happening faily frequently
Versions: Server: prefecthq/prefect:2.11.4-python3.11 Worker: prefecthq/prefect:2.11.4-python3.9.2
Bumping this as it is still causing my pod to crash and I have to run clean up scripts for the Pending and Running FlowRuns that were in flight on the worker
I believe this is closed as completed thanks to @desertaxle. Please feel free to reopen if you come upon this issue in more recent versions.
First check
Bug summary
Our setup is the following: We deploy prefect in EKS via helm using
prefect-server
from https://prefecthq.github.io/prefect-helm for the server andprefect-agent
from the same repo for the agent. (Both chart versions2023.7.20
)Our server has a minimum of 2 replicas and the HPA allows it to scale as high as 10.
We patched the agent with Kustomize and added an HPA as it's not supported by the chart by default. It has a minimum of 2 replicas and can scale as high as 10.
Our flows are running on Fargate using ECS task definitions.
The issue we face is that sometimes, agents throw the
RuntimeError: this borrower is already holding one of this CapacityLimiter's tokens
error and then hang until all the flows they are tracking finish. They are not handling new flow schedules and the flow queue gets stuck until tracked flows are finished.Moreover, we use a retry policy on some of our flows, and it seems like if a different agent picks the retry flow, the agent that handled the first run still "tracks it" and can get stuck for even longer. We had cases were 2 agents crashed 1 hour apart from each other but only recovered a few hours later once all flows were done as they were both tracking a very long flow.
Reproduction
Error
Versions
Additional context
Our issue isn't necessarily the crash itself, if the container errored out properly, it would be recreated by EKS and would auto recover. Our issue is mostly the hanging part of it because it doesn't exit when the issue is happening but later when flows are done.