PrefectHQ / prefect

Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
https://prefect.io
Apache License 2.0
17.19k stars 1.63k forks source link

This borrower is already holding one of this CapacityLimiter's tokens #10632

Closed ArGJolan closed 8 months ago

ArGJolan commented 1 year ago

First check

Bug summary

Our setup is the following: We deploy prefect in EKS via helm using prefect-server from https://prefecthq.github.io/prefect-helm for the server and prefect-agent from the same repo for the agent. (Both chart versions 2023.7.20)

Our server has a minimum of 2 replicas and the HPA allows it to scale as high as 10.

We patched the agent with Kustomize and added an HPA as it's not supported by the chart by default. It has a minimum of 2 replicas and can scale as high as 10.

Our flows are running on Fargate using ECS task definitions.

The issue we face is that sometimes, agents throw the RuntimeError: this borrower is already holding one of this CapacityLimiter's tokens error and then hang until all the flows they are tracking finish. They are not handling new flow schedules and the flow queue gets stuck until tracked flows are finished.

Moreover, we use a retry policy on some of our flows, and it seems like if a different agent picks the retry flow, the agent that handled the first run still "tracks it" and can get stuck for even longer. We had cases were 2 agents crashed 1 hour apart from each other but only recovered a few hours later once all flows were done as they were both tracking a very long flow.

Reproduction

We did not find anything that indicated why the agents were erroring.
A previous issue open in this repo has the same error but it seems resolved since December 2022, our version is more recent than that.

Error

-09-04 10:46:31 08:46:31.510 | DEBUG   | prefect.agent - Checking for scheduled flow runs...
2023-09-04 10:46:35 08:46:35.362 | DEBUG   | prefect.agent - Checking for cancelled flow runs...
2023-09-04 10:46:35 08:46:35.537 | DEBUG   | prefect.agent - Checking for scheduled flow runs...
2023-09-04 10:46:40 08:46:40.918 | DEBUG   | prefect.agent - Checking for scheduled flow runs...
2023-09-04 10:46:45 08:46:45.751 | DEBUG   | prefect.agent - Checking for scheduled flow runs...
2023-09-04 10:46:46 08:46:46.873 | DEBUG   | prefect.agent - Checking for cancelled flow runs...
2023-09-04 10:46:50 08:46:50.080 | DEBUG   | prefect.agent - Checking for scheduled flow runs...
2023-09-04 10:46:50 08:46:50.545 | DEBUG   | MainThread   | prefect._internal.concurrency - <AsyncCancelScope, name='start' COMPLETED, runtime=1260.76> exited
2023-09-04 10:46:50 08:46:50.546 | DEBUG   | MainThread   | prefect._internal.concurrency - Encountered exception in async call start(work_queue=None, work_queues=['default', 'tracking-reports'], work_queue_prefix=[], work_pool_name='...)
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/prefect/_internal/concurrency/calls.py", line 346, in _run_async
    result = await coro
  File "/usr/local/lib/python3.8/site-packages/prefect/cli/agent.py", line 209, in start
    tg.start_soon(
  File "/usr/local/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 597, in __aexit__
    raise exceptions[0]
  File "/usr/local/lib/python3.8/site-packages/prefect/utilities/services.py", line 53, in critical_service_loop
    await workload()
  File "/usr/local/lib/python3.8/site-packages/prefect/agent.py", line 233, in get_and_submit_flow_runs
    self.limiter.acquire_on_behalf_of_nowait(flow_run.id)
  File "/usr/local/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 1846, in acquire_on_behalf_of_nowait
    raise RuntimeError(
RuntimeError: this borrower is already holding one of this CapacityLimiter's tokens
Loaded collection 'prefect_aws'.
Loaded collection 'prefect_aws'.
Loaded collection 'prefect_aws'.
Loaded collection 'prefect_aws'.
Loaded collection 'prefect_aws'.
Loaded collection 'prefect_aws'.
Loaded collection 'prefect_aws'.
Loaded collection 'prefect_aws'.
Loaded collection 'prefect_aws'.
Loaded collection 'prefect_aws'.
Loaded collection 'prefect_aws'.
Loaded collection 'prefect_aws'.
Loaded collection 'prefect_aws'.
Loaded collection 'prefect_aws'.
Loaded collection 'prefect_aws'.
Loaded collection 'prefect_aws'.
Loaded collection 'prefect_aws'.
Loaded collection 'prefect_aws'.
Loaded collection 'prefect_aws'.
Loaded collection 'prefect_aws'.
Loaded collection 'prefect_aws'.
Loaded collection 'prefect_aws'.
Loaded collection 'prefect_aws'.
Loaded collection 'prefect_aws'.
Loaded collection 'prefect_aws'.
Loaded collection 'prefect_aws'.
Loaded collection 'prefect_aws'.
Loaded collection 'prefect_aws'.
Loaded collection 'prefect_aws'.
Loaded collection 'prefect_aws'.
Loaded collection 'prefect_aws'.
Loaded collection 'prefect_aws'.
Loaded collection 'prefect_aws'.
Loaded collection 'prefect_aws'.
Loaded collection 'prefect_aws'.
Loaded collection 'prefect_aws'.
Loaded collection 'prefect_aws'.
Loaded collection 'prefect_aws'.
Loaded collection 'prefect_aws'.
Loaded collection 'prefect_aws'.
Loaded collection 'prefect_aws'.
An exception occurred.
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/prefect/cli/_utilities.py", line 41, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/prefect/utilities/asyncutils.py", line 255, in coroutine_wrapper
    return call()
  File "/usr/local/lib/python3.8/site-packages/prefect/_internal/concurrency/calls.py", line 383, in __call__
    return self.result()
  File "/usr/local/lib/python3.8/site-packages/prefect/_internal/concurrency/calls.py", line 283, in result
    return self.future.result(timeout=timeout)
  File "/usr/local/lib/python3.8/site-packages/prefect/_internal/concurrency/calls.py", line 169, in result
    return self.__get_result()
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.8/site-packages/prefect/_internal/concurrency/calls.py", line 346, in _run_async
    result = await coro
  File "/usr/local/lib/python3.8/site-packages/prefect/cli/agent.py", line 209, in start
    tg.start_soon(
  File "/usr/local/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 597, in __aexit__
    raise exceptions[0]
  File "/usr/local/lib/python3.8/site-packages/prefect/utilities/services.py", line 53, in critical_service_loop
    await workload()
  File "/usr/local/lib/python3.8/site-packages/prefect/agent.py", line 233, in get_and_submit_flow_runs
    self.limiter.acquire_on_behalf_of_nowait(flow_run.id)
  File "/usr/local/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 1846, in acquire_on_behalf_of_nowait
    raise RuntimeError(
RuntimeError: this borrower is already holding one of this CapacityLimiter's tokens
2023-09-04 10:48:00 08:48:00.426 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_646f5dd430dab5248b2d8443': Status is DEPROVISIONING.
2023-09-04 10:48:20 08:48:20.582 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_646f5dd430dab5248b2d8443': Status is STOPPED.
2023-09-04 10:48:20 08:48:20.583 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_646f5dd430dab5248b2d8443': Container 'prefect' exited successfully.
2023-09-04 10:49:10 08:49:10.230 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_6490702d67c8310aaf85b851': Status is DEPROVISIONING.
2023-09-04 10:49:13 08:49:13.552 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_60d43a2f21699f0039190d97': Status is DEPROVISIONING.
2023-09-04 10:49:20 08:49:20.310 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_6490702d67c8310aaf85b851': Status is STOPPED.
2023-09-04 10:49:20 08:49:20.311 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_6490702d67c8310aaf85b851': Container 'prefect' exited successfully.
2023-09-04 10:49:23 08:49:23.630 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_60d43a2f21699f0039190d97': Status is STOPPED.
2023-09-04 10:49:23 08:49:23.631 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_60d43a2f21699f0039190d97': Container 'prefect' exited successfully.
2023-09-04 10:49:27 08:49:27.626 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_6436d059f716f06e5f9ef38c': Status is DEPROVISIONING.
2023-09-04 10:49:37 08:49:37.724 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_6436d059f716f06e5f9ef38c': Status is STOPPED.
2023-09-04 10:49:37 08:49:37.725 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_6436d059f716f06e5f9ef38c': Container 'prefect' exited successfully.
2023-09-04 10:51:00 08:51:00.498 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_64d3d572c43cc23d53dfdd67': Status is DEPROVISIONING.
2023-09-04 10:51:10 08:51:10.578 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_64d3d572c43cc23d53dfdd67': Status is STOPPED.
2023-09-04 10:51:10 08:51:10.578 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_64d3d572c43cc23d53dfdd67': Container 'prefect' exited successfully.
2023-09-04 10:57:34 08:57:34.294 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_62b4768bc1b9f73649e2030b': Status is DEPROVISIONING.
2023-09-04 10:57:44 08:57:44.370 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_62b4768bc1b9f73649e2030b': Status is STOPPED.
2023-09-04 10:57:44 08:57:44.371 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_62b4768bc1b9f73649e2030b': Container 'prefect' exited successfully.
2023-09-04 11:00:10 09:00:10.993 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_63c4178c204e8f62b6294571': Status is DEPROVISIONING.
2023-09-04 11:00:31 09:00:31.146 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_63c4178c204e8f62b6294571': Status is STOPPED.
2023-09-04 11:00:31 09:00:31.147 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_63c4178c204e8f62b6294571': Container 'prefect' exited successfully.
2023-09-04 11:04:35 09:04:35.033 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_64520ed110685730374941fa': Status is DEPROVISIONING.
2023-09-04 11:04:45 09:04:45.124 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_64520ed110685730374941fa': Status is STOPPED.
2023-09-04 11:04:45 09:04:45.125 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_64520ed110685730374941fa': Container 'prefect' exited successfully.
2023-09-04 11:09:52 09:09:52.445 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_61fa3fc54c85c300391b7b25': Status is DEPROVISIONING.
2023-09-04 11:10:12 09:10:12.601 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_61fa3fc54c85c300391b7b25': Status is STOPPED.
2023-09-04 11:10:12 09:10:12.602 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_61fa3fc54c85c300391b7b25': Container 'prefect' exited successfully.
2023-09-04 11:17:28 09:17:28.387 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_6312075f967d940f5db90723': Status is DEPROVISIONING.
2023-09-04 11:17:38 09:17:38.461 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_6312075f967d940f5db90723': Status is STOPPED.
2023-09-04 11:17:38 09:17:38.462 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_6312075f967d940f5db90723': Container 'prefect' exited successfully.
2023-09-04 11:21:48 09:21:48.008 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_64959e3ee1013135ebb29ee7': Status is DEPROVISIONING.
2023-09-04 11:21:58 09:21:58.087 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_64959e3ee1013135ebb29ee7': Status is STOPPED.
2023-09-04 11:21:58 09:21:58.088 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_64959e3ee1013135ebb29ee7': Container 'prefect' exited successfully.
2023-09-04 11:26:35 09:26:35.936 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_635638f0e160742643e653e9': Status is DEPROVISIONING.
2023-09-04 11:26:46 09:26:46.010 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_635638f0e160742643e653e9': Status is STOPPED.
2023-09-04 11:26:46 09:26:46.010 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_635638f0e160742643e653e9': Container 'prefect' exited successfully.
2023-09-04 11:32:15 09:32:15.612 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_644271c37d569e03b23979b3': Status is DEPROVISIONING.
2023-09-04 11:32:25 09:32:25.388 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_64427494af7c0806755f538e': Status is DEPROVISIONING.
2023-09-04 11:32:35 09:32:35.767 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_644271c37d569e03b23979b3': Status is STOPPED.
2023-09-04 11:32:35 09:32:35.768 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_644271c37d569e03b23979b3': Container 'prefect' exited successfully.
2023-09-04 11:32:47 09:32:47.537 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_64427494af7c0806755f538e': Status is STOPPED.
2023-09-04 11:32:47 09:32:47.538 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_64427494af7c0806755f538e': Container 'prefect' exited successfully.
2023-09-04 11:41:52 09:41:52.980 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_62739824f396593b79d0cd5d': Status is DEPROVISIONING.
2023-09-04 11:42:03 09:42:03.050 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_62739824f396593b79d0cd5d': Status is STOPPED.
2023-09-04 11:42:03 09:42:03.051 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_62739824f396593b79d0cd5d': Container 'prefect' exited successfully.
2023-09-04 11:43:54 09:43:54.135 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_634bc9c5504bcb0fb91aa7e1': Status is DEPROVISIONING.
2023-09-04 11:44:04 09:44:04.205 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_634bc9c5504bcb0fb91aa7e1': Status is STOPPED.
2023-09-04 11:44:04 09:44:04.206 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_634bc9c5504bcb0fb91aa7e1': Container 'prefect' exited successfully.
2023-09-04 11:52:42 09:52:42.967 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_62e939684df7a518ce222d14': Status is DEPROVISIONING.
2023-09-04 11:53:03 09:53:03.131 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_62e939684df7a518ce222d14': Status is STOPPED.
2023-09-04 11:53:03 09:53:03.132 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_62e939684df7a518ce222d14': Container 'prefect' exited successfully.
2023-09-04 12:00:30 10:00:30.569 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_64230678a737cf2e42bf32c2': Status is DEPROVISIONING.
2023-09-04 12:00:40 10:00:40.647 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_64230678a737cf2e42bf32c2': Status is STOPPED.
2023-09-04 12:00:40 10:00:40.648 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_64230678a737cf2e42bf32c2': Container 'prefect' exited successfully.
2023-09-04 12:28:20 10:28:20.807 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_648c60e353ac7959f79cda4d': Status is DEPROVISIONING.
2023-09-04 12:28:30 10:28:30.885 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_648c60e353ac7959f79cda4d': Status is STOPPED.
2023-09-04 12:28:30 10:28:30.885 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_648c60e353ac7959f79cda4d': Container 'prefect' exited successfully.
2023-09-04 14:19:15 12:19:15.652 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_6402174fe4175c55131a3e1e': Status is DEPROVISIONING.
2023-09-04 14:19:27 12:19:27.728 | INFO    | prefect.infrastructure.ecs-task - ECSTask 'Flow_6402174fe4175c55131a3e1e': Status is STOPPED.
2023-09-04 14:19:27 12:19:27.729 | WARNING | prefect.infrastructure.ecs-task - ECSTask 'Flow_6402174fe4175c55131a3e1e': Container 'prefect' exited with non-zero exit code 137.
2023-09-04 14:19:27 12:19:27.731 | DEBUG   | EventsWorkerThread | prefect._internal.concurrency - <WatcherThreadCancelScope, name='get' COMPLETED, runtime=14016.44> exited
2023-09-04 14:19:27 12:19:27.732 | DEBUG   | EventsWorkerThread | prefect._internal.concurrency - Finished call get()
2023-09-04 14:19:27 12:19:27.732 | DEBUG   | GlobalEventLoopThread | prefect._internal.concurrency - Exiting service <prefect.events.worker.EventsWorker object at 0x7f87cf573d00>

Versions

Server & Agent: prefecthq/prefect:2.11.0-python3.8

Additional context

Our issue isn't necessarily the crash itself, if the container errored out properly, it would be recreated by EKS and would auto recover. Our issue is mostly the hanging part of it because it doesn't exit when the issue is happening but later when flows are done.

gregoryhunt commented 1 year ago

I am also getting this error is my Process Workers. I have an auto scaling Pod of Process workers which pulls from a queue. This specific error causes the container to crash and be restarted.

I have a pretty high volume so I is happening faily frequently

Versions: Server: prefecthq/prefect:2.11.4-python3.11 Worker: prefecthq/prefect:2.11.4-python3.9.2

gregoryhunt commented 1 year ago

Bumping this as it is still causing my pod to crash and I have to run clean up scripts for the Pending and Running FlowRuns that were in flight on the worker

serinamarie commented 8 months ago

I believe this is closed as completed thanks to @desertaxle. Please feel free to reopen if you come upon this issue in more recent versions.