Open arneyjfs opened 5 months ago
For some additional context, I removed the prefect @task
decorator from the task to let Dask handle scaling the tasks natively and I was easily able to scale to 1,000,000+ tasks. I think I can therefore also rule out the dask server as being a bottleneck here
Has anyone been able to reproduce this? @desertaxle? Perhaps there is some more information I can provide
I get the same error when submitting a large number of tasks using the ConcurrentTaskRunner
We occasionally see this on regular tasks/flows as well:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/prefect/engine.py", line 872, in orchestrate_flow_run
result = await flow_call.aresult()
File "/usr/local/lib/python3.10/site-packages/prefect/_internal/concurrency/calls.py", line 327, in aresult
return await asyncio.wrap_future(self.future)
File "/usr/local/lib/python3.10/site-packages/prefect/_internal/concurrency/calls.py", line 389, in _run_async
result = await coro
File "/opt/prefect/flows/extraction/tiktok/tiktok_e_accounts.py", line 122, in extract_adgroups
await asyncio.gather(*tasks)
File "/usr/local/lib/python3.10/site-packages/prefect/_internal/concurrency/api.py", line 150, in wait_for_call_in_loop_thread
return call.result()
File "/usr/local/lib/python3.10/site-packages/prefect/_internal/concurrency/calls.py", line 318, in result
return self.future.result(timeout=timeout)
File "/usr/local/lib/python3.10/site-packages/prefect/_internal/concurrency/calls.py", line 179, in result
return self.__get_result()
File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
File "/usr/local/lib/python3.10/site-packages/prefect/_internal/concurrency/calls.py", line 389, in _run_async
result = await coro
File "/usr/local/lib/python3.10/site-packages/prefect/engine.py", line 1600, in get_task_call_return_value
return await future._result()
File "/usr/local/lib/python3.10/site-packages/prefect/futures.py", line 237, in _result
return await final_state.result(raise_on_failure=raise_on_failure, fetch=True)
File "/usr/local/lib/python3.10/site-packages/prefect/states.py", line 91, in _get_state_result
raise await get_state_exception(state)
File "/usr/local/lib/python3.10/site-packages/prefect/task_runners.py", line 231, in submit
result = await call()
File "/usr/local/lib/python3.10/site-packages/prefect/engine.py", line 1851, in begin_task_run
state = await orchestrate_task_run(
File "/usr/local/lib/python3.10/site-packages/prefect/engine.py", line 2033, in orchestrate_task_run
state = await propose_state(
File "/usr/local/lib/python3.10/site-packages/prefect/engine.py", line 2576, in propose_state
response = await set_state_and_handle_waits(set_state)
File "/usr/local/lib/python3.10/site-packages/prefect/engine.py", line 2563, in set_state_and_handle_waits
response = await set_state_func()
File "/usr/local/lib/python3.10/site-packages/prefect/client/orchestration.py", line 2317, in set_task_run_state
response = await self._client.post(
File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1892, in post
return await self.request(
File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1574, in request
return await self.send(request, auth=auth, follow_redirects=follow_redirects)
File "/usr/local/lib/python3.10/site-packages/prefect/client/base.py", line 311, in send
response = await self._send_with_retry(
File "/usr/local/lib/python3.10/site-packages/prefect/client/base.py", line 235, in _send_with_retry
response = await send(request, *send_args, **send_kwargs)
File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1661, in send
response = await self._send_handling_auth(
File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1689, in _send_handling_auth
response = await self._send_handling_redirects(
File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1726, in _send_handling_redirects
response = await self._send_single_request(request)
File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1763, in _send_single_request
response = await transport.handle_async_request(request)
File "/usr/local/lib/python3.10/site-packages/httpx/_transports/default.py", line 372, in handle_async_request
with map_httpcore_exceptions():
File "/usr/local/lib/python3.10/contextlib.py", line 153, in __exit__
self.gen.throw(typ, value, traceback)
File "/usr/local/lib/python3.10/site-packages/httpx/_transports/default.py", line 86, in map_httpcore_exceptions
raise mapped_exc(message) from exc
httpx.PoolTimeout
We're using prefect 2.16.9 in our jobs
Based on https://github.com/PrefectHQ/prefect/blob/2.16.9/src/prefect/client/base.py#L325 it looks like there is a retry for the httpx.PoolTimeout. By default the send will retry 5 times, so it's odd that it's still having issues by then unless the API is seriously overloaded or there are too many connections being initiated simultaneously on the process/container/host.
We use a kubernetes backend with a fair amount of asyncio coroutines (and each coroutine may make calls to the prefect for blocks or rate limiting). It's common for us to have 100-200 simultaneously running flows. The flow runs are distributed across ~25 small (1-2 CPU) hosts. Since our workflows are io-bound, we never see CPU maxed.
Is there any known solution to get around this issue? We are facing similar problems with spinning up many mapped tasks / DaskTaskRunner
and are unable to make progress around this point
I was considering changing the PREFECT_API_REQUEST_TIMEOUT
and PREFECT_CLIENT_MAX_RETRIES
, but those default to 60.0
and 5
, which means it should theoretically try for > 300 seconds (with 2**{retry_num}
backoff on _send_with_retry
). I'd be surprised if it needs that much time if it's just a transient issue.
That said, maybe it's possible that the retries aren't actually going through, or the pool gets into a bad state. Would probably need to work in some debug logs to figure out what might be going on (at least within the _send_with_retries code). If I can get some free time, I'll try to patch things in our dev code image and run the reproduction case provided above.
I've seen improvement in httpx.PoolTimeout
errors by disabling HTTP2 in the Prefect client. You can disable HTTP2 by setting PREFECT_API_ENABLE_HTTP2=False
in your profile.
There's also been a big reduction in the number of API calls made during task orchestration in 3.0, so it might be worthwhile to try out 3.0 and see if there's improvement for your use case!
We're already using PREFECT_API_ENABLE_HTTP2=False and still see this a fair amount (though I don't remember if it's an improvement from having it set to True)
Similarly, we've tried updating PREFECT_API_ENABLE_HTTP2=False and are still seeing failures with runs across ~1500 mapped tasks. When looking directly at DEBUG logs we are able to surface the following logs showing Prefect worker threads shutting down before all goes quiet. We are unfortunately unable to upgrade to Prefect3 as our organization is using Prefect2.
ERROR 2024-08-22T18:00:54.010352253Z [resource.labels.taskName: workerpool0-0] 18:00:53.254 | DEBUG | APILogWorkerThread | prefect._internal.concurrency - Running call get(timeout=1.8705078409999487) in thread 'APILogWorkerThread'
ERROR 2024-08-22T18:00:54.010356674Z [resource.labels.taskName: workerpool0-0] 18:00:53.254 | DEBUG | APILogWorkerThread | prefect._internal.concurrency - Running call get(timeout=1.8681604759999573) in thread 'APILogWorkerThread'
ERROR 2024-08-22T18:00:54.010360610Z [resource.labels.taskName: workerpool0-0] 18:00:53.296 | DEBUG | APILogWorkerThread | prefect._internal.concurrency - Running call get(timeout=1.9999872829999958) in thread 'APILogWorkerThread'
ERROR 2024-08-22T18:00:54.010364465Z [resource.labels.taskName: workerpool0-0] 18:00:53.394 | INFO | WorkerThread-23 | prefect._internal.concurrency - Exiting worker thread 'WorkerThread-23'
ERROR 2024-08-22T18:00:54.010371081Z [resource.labels.taskName: workerpool0-0] 18:00:53.413 | INFO | WorkerThread-30 | prefect._internal.concurrency - Exiting worker thread 'WorkerThread-30'
ERROR 2024-08-22T18:00:54.010386549Z [resource.labels.taskName: workerpool0-0] 18:00:53.545 | DEBUG | APILogWorkerThread | prefect._internal.concurrency - Running call get(timeout=1.5788529749999611) in thread 'APILogWorkerThread'
ERROR 2024-08-22T18:00:54.010390523Z [resource.labels.taskName: workerpool0-0] 18:00:53.546 | DEBUG | EventsWorkerThread | prefect._internal.concurrency - Running call get() in thread 'EventsWorkerThread'
ERROR 2024-08-22T18:00:54.010394558Z [resource.labels.taskName: workerpool0-0] 18:00:53.553 | DEBUG | APILogWorkerThread | prefect._internal.concurrency - Running call get(timeout=1.743925106000006) in thread 'APILogWorkerThread'
ERROR 2024-08-22T18:00:54.010399764Z [resource.labels.taskName: workerpool0-0] 18:00:53.554 | DEBUG | EventsWorkerThread | prefect._internal.concurrency - Running call get() in thread 'EventsWorkerThread'
ERROR 2024-08-22T18:00:54.010404003Z [resource.labels.taskName: workerpool0-0] 18:00:53.634 | INFO | APILogWorkerThread | prefect._internal.concurrency - Exiting worker thread 'APILogWorkerThread'
Finally, when it fails. PoolTimeout crash details are as follows:
Crash details:
Exception Group Traceback (most recent call last): | File "/opt/block/venv/lib/python3.9/site-packages/prefect/engine.py", line 2254, in report_flow_run_crashes | yield | File "/usr/local/lib/python3.9/contextlib.py", line 653, in aexit | cb_suppress = await cb(*exc_details) | File "/opt/block/venv/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 680, in aexit | raise BaseExceptionGroup( | exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (14 sub-exceptions) +-+---------------- 1 ---------------- | Traceback (most recent call last): | File "/opt/block/venv/lib/python3.9/site-packages/httpx/_transports/default.py", line 69, in map_httpcore_exceptions | yield | File "/opt/block/venv/lib/python3.9/site-packages/httpx/_transports/default.py", line 373, in handle_async_request | resp = await self._pool.handle_async_request(req) | File "/opt/block/venv/lib/python3.9/site-packages/httpcore/_async/connection_pool.py", line 216, in handle_async_request | raise exc from None | File "/opt/block/venv/lib/python3.9/site-packages/httpcore/_async/connection_pool.py", line 192, in handle_async_request | connection = await pool_request.wait_for_connection(timeout=timeout) | File "/opt/block/venv/lib/python3.9/site-packages/httpcore/_async/connection_pool.py", line 35, in wait_for_connection | await self._connection_acquired.wait(timeout=timeout) | File "/opt/block/venv/lib/python3.9/site-packages/httpcore/_synchronization.py", line 150, in wait | await self._anyio_event.wait() | File "/usr/local/lib/python3.9/contextlib.py", line 137, in exit | self.gen.throw(typ, value, traceback) | File "/opt/block/venv/lib/python3.9/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions | raise to_exc(exc) from exc | httpcore.PoolTimeout |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
The above exception was the direct cause of the following exception: | |||||||||||||||||||||||||
Traceback (most recent call last): | |||||||||||||||||||||||||
File "/opt/block/venv/lib/python3.9/site-packages/prefect/engine.py", line 1672, in create_task_run_then_submit | |||||||||||||||||||||||||
await create_task_run( | |||||||||||||||||||||||||
File "/opt/block/venv/lib/python3.9/site-packages/prefect/engine.py", line 1720, in create_task_run | |||||||||||||||||||||||||
task_run = await flow_run_context.client.create_task_run( | |||||||||||||||||||||||||
File "/opt/block/venv/lib/python3.9/site-packages/prefect/client/orchestration.py", line 2226, in create_task_run | |||||||||||||||||||||||||
response = await self._client.post( | |||||||||||||||||||||||||
File "/opt/block/venv/lib/python3.9/site-packages/httpx/_client.py", line 1892, in post | |||||||||||||||||||||||||
return await self.request( | |||||||||||||||||||||||||
File "/opt/block/venv/lib/python3.9/site-packages/httpx/_client.py", line 1574, in request | |||||||||||||||||||||||||
return await self.send(request, auth=auth, follow_redirects=follow_redirects) | |||||||||||||||||||||||||
File "/opt/block/venv/lib/python3.9/site-packages/prefect/client/base.py", line 327, in send | |||||||||||||||||||||||||
response = await self._send_with_retry( | |||||||||||||||||||||||||
File "/opt/block/venv/lib/python3.9/site-packages/prefect/client/base.py", line 251, in _send_with_retry | |||||||||||||||||||||||||
response = await send(request, *send_args, **send_kwargs) | |||||||||||||||||||||||||
File "/opt/block/venv/lib/python3.9/site-packages/httpx/_client.py", line 1661, in send | |||||||||||||||||||||||||
response = await self._send_handling_auth( | |||||||||||||||||||||||||
File "/opt/block/venv/lib/python3.9/site-packages/httpx/_client.py", line 1689, in _send_handling_auth | |||||||||||||||||||||||||
response = await self._send_handling_redirects( | |||||||||||||||||||||||||
File "/opt/block/venv/lib/python3.9/site-packages/httpx/_client.py", line 1726, in _send_handling_redirects | |||||||||||||||||||||||||
response = await self._send_single_request(request) | |||||||||||||||||||||||||
File "/opt/block/venv/lib/python3.9/site-packages/httpx/_client.py", line 1763, in _send_single_request | |||||||||||||||||||||||||
response = await transport.handle_async_request(request) | |||||||||||||||||||||||||
File "/opt/block/venv/lib/python3.9/site-packages/httpx/_transports/default.py", line 373, in handle_async_request | |||||||||||||||||||||||||
resp = await self._pool.handle_async_request(req) | |||||||||||||||||||||||||
File "/usr/local/lib/python3.9/contextlib.py", line 137, in exit | |||||||||||||||||||||||||
self.gen.throw(typ, value, traceback) | |||||||||||||||||||||||||
File "/opt/block/venv/lib/python3.9/site-packages/httpx/_transports/default.py", line 86, in map_httpcore_exceptions | |||||||||||||||||||||||||
raise mapped_exc(message) from exc | |||||||||||||||||||||||||
httpx.PoolTimeout |
Looking into the nodes seeing PoolTimeouts, I noticed we had a flow run that can run up to 100-150 tasks concurrently through asyncio. Each task uses prefect async ratelimit to throttle requests for data extraction from an API - in addition to any connections used to manage prefect task/flow logs or state, so there may be a fair number of active connections.
After I reduced the parallelism to 50-75, I haven't seen any more pool timeouts whereas we saw them regularly before.
If we can confirm that the retries are happening as expected (5 times over at least 300 seconds), then I'm inclined to believe it's more of an infrastructure connection limitation issue per process/pod/host/NAT/prefect workspace/prefect API. I'm not familiar with how the connections work at the task level, but if any tasks have their own connections, maybe having a common manager through which all tasks interface would help? Otherwise there may be a physical limit to how many tasks can start/run/finish simultaneously.
@peakv-dersek thanks for sharing! It makes sense that the pool timeout errors decreased after decreasing the number of parallel tasks based on what I'm seeing in the httpx docs about pool timeouts. It seems like increasing the max connections on the httpx
client could help with these issues at higher concurrency/parallelism (as long the machine has enough resources).
Looks like the client sets the max_connections
here:https://github.com/PrefectHQ/prefect/blob/cb2874fea36e5ec8789a1cdfdf9955238e75093e/src/prefect/client/orchestration.py#L325
16 is quite a bit lower than the httpx
default. I think it's reasonable to make that limit configurable via a setting (something like PREFECT_API_MAX_CONNECTIONS
).
It's the comment above it that concerns me: https://github.com/PrefectHQ/prefect/blob/cb2874fea36e5ec8789a1cdfdf9955238e75093e/src/prefect/client/orchestration.py#L323-L324
Seems like it might be an issue either way. Alternatively, maybe we could try lowering it further if the task parallelism is increasing.
Next step I think is to do some exploration of the PrefectClient within each task - are they independent instances? Are the connections independent?
https://github.com/encode/httpx/discussions/1633#discussioncomment-717658
Yes. HTTPX is intended to be thread-safe, and yes, a single client-instance across all threads will do better in terms of connection pooling, than using an instance-per-thread.
Makes me think that we should be setting up a client (or at least a single connection pool) per process and sharing that among the tasks
TL;DR - issue doesn't seem to be client-side. Potential workaround is increasing PREFECT_API_REQUEST_TIMEOUT to 300 or more for the infra spec of work pools that will be expected to run flows with parallelization on tasks or other entities that require the prefect API.
Tried out a few things within a running flow using an asyncio repro case (similar to the original repro, but using async def on the task & flow and asyncio.gather to schedule the tasks and wait for them to complete), I see the pool timeouts as well.
Interestingly, it looks like all task creation is happening on the same Client and it isn't able to create all the tasks in time.
# python -m repro
Thread: 140084441921216 Created client <prefect.client.base.PrefectHttpxClient object at 0x7f67f367a650>
22:24:26.122 | INFO | prefect.engine - Created flow run 'determined-elk' for flow 'example_flow'
22:24:26.124 | INFO | Flow run 'determined-elk' - View at https://app.prefect.cloud/account/<redacted>/workspace/<redacted>/flow-runs/flow-run/<redacted>
22:24:26.364 | INFO | Flow run 'determined-elk' - Triggering 2000 tasks
22:24:27.264 | INFO | Flow run 'determined-elk' - Thread: 140084441921216 Created client <prefect.client.base.PrefectHttpxClient object at 0x7f67f01ebf40>
22:27:00.463 | INFO | Flow run 'determined-elk' - Retryable exception in thread 140084441921216: <class 'httpx.PoolTimeout'>, PoolTimeout('')
22:27:00.572 | INFO | Flow run 'determined-elk' - Retryable exception in thread 140084441921216: <class 'httpx.PoolTimeout'>, PoolTimeout('')
22:27:00.764 | INFO | Flow run 'determined-elk' - Retryable exception in thread 140084441921216: <class 'httpx.PoolTimeout'>, PoolTimeout('')
22:27:00.868 | INFO | Flow run 'determined-elk' - Retryable exception in thread 140084441921216: <class 'httpx.PoolTimeout'>, PoolTimeout('')
22:27:00.972 | INFO | Flow run 'determined-elk' - Retryable exception in thread 140084441921216: <class 'httpx.PoolTimeout'>, PoolTimeout('')
22:27:01.164 | INFO | Flow run 'determined-elk' - Retryable exception in thread 140084441921216: <class 'httpx.PoolTimeout'>, PoolTimeout('')
22:27:01.267 | INFO | Flow run 'determined-elk' - Retryable exception in thread 140084441921216: <class 'httpx.PoolTimeout'>, PoolTimeout('')
...
I tried increasing the timeout to 600 seconds (PREFECT_API_REQUEST_TIMEOUT) and saw it working at the n=2000 case:
22:29:01.107 | INFO | prefect.engine - Created flow run 'opalescent-shellfish' for flow 'example_flow'
22:29:01.109 | INFO | Flow run 'opalescent-shellfish' - View at https://app.prefect.cloud/account/<redacted>/workspace/<redacted>/flow-runs/flow-run/<redacted>
22:29:01.360 | INFO | Flow run 'opalescent-shellfish' - Triggering 2000 tasks
22:29:02.374 | INFO | Flow run 'opalescent-shellfish' - Thread: 140693059696320 Created client <prefect.client.base.PrefectHttpxClient object at 0x7ff59c043c40>
22:31:25.574 | INFO | Flow run 'opalescent-shellfish' - Created task run 'example_task-10' for task 'example_task'
22:31:25.575 | INFO | Flow run 'opalescent-shellfish' - Executing 'example_task-10' immediately...
22:31:25.775 | INFO | Flow run 'opalescent-shellfish' - Created task run 'example_task-9' for task 'example_task'
22:31:25.776 | INFO | Flow run 'opalescent-shellfish' - Executing 'example_task-9' immediately...
22:31:26.471 | INFO | Flow run 'opalescent-shellfish' - Created task run 'example_task-11' for task 'example_task'
22:31:26.472 | INFO | Flow run 'opalescent-shellfish' - Executing 'example_task-11' immediately...
22:31:26.764 | INFO | Flow run 'opalescent-shellfish' - Created task run 'example_task-12' for task 'example_task'
22:31:26.764 | INFO | Flow run 'opalescent-shellfish' - Executing 'example_task-12' immediately...
...
It takes the client 2.5 minutes to ~start~ finish getting all the tasks actually created in the api.
Increasing the max connections to 1024 while keeping the default timeout: It sends all the requests, but doesn't seem to make any progress even after 10 minutes (not even a pool timeout error). Logging the send requests, after submitting the initial 1024 very quickly, it's only able to do 1 task per second for the rest. (The previous approach maintains a much faster throughput)
In terms of flow, all the task creation POST are handled by a single prefect client (from the flow), so if too many are created at once, it seems like the pool gets backed up. There are also POST to task run /set_state
and GET to the task id all done within the flow (same client) around the time of task creation. From there the task takes over and starts making its own calls the to API. I didn't test out the parallelism of the tasks themselves in my setup, but it seemed like the initial creation was the problem anyway.
Side note: I'm not exactly sure why the task creation/execution doesn't seem very asynchronous (all /task_runs
POST requests individual tasks start running and trying to set their state), but that's something for another day...
Will probably need to look at the Dask version of this since it seems like that does run tasks as they get created rather than waiting for all of them, so I'd want to make sure it's still an issue on the task creation pool.
First check
Bug summary
I have a flow run which triggers about 2,000 tasks. I am using the DaskTaskRunner to handle running this many tasks at scale. After about 200 tasks are scheduled, scheduling hangs. After this point the flow continues for about 5 minutes, with the scheduled tasks running fine but no more being scheduled. All tasks that managed to be scheduled eventually complete and the flow hangs in a running status, not finished, but with not picking showing any new tasks starting. After a seemingly variable amount of time (0-5mins?), the flow crashes with a PoolTimeout trying to reach
<Private-Prefect-Server-IP>/api/task_runs/
Detailed sequential walkthrough of error
All logs are fine, apart from eventually reporting the crash in various ways. The docker logs on the prefect-worker are the only logs with a bit more info as shown in the stack trace provided below.
There is clearly a bottleneck here but I am struggling to find where it is. The cpu and memory stays at a normal level for all VMs. Thinking the bottleneck may be the prefect-server database (after reading similar issues) I moved to a Cloud SQL managed postgreSQL database. That has not fixed it - the database shows low loading and the issue still happens.
Reproduction
Error
Versions
Additional context
This may be a terrible architecture, so if the solution is to use prefect in a diifferent way then I would also welcome so advice here. A post to the prefect discourse on how to do something like this did not lead to any advice on good architecture patterns. For context I tried another approach of using prefect workers to distribute tasks out to other VMs (i.e. instead of 1-prefect-server => 1-prefect-worker => 1-dask-cluster => many dask-workers, i tried 1-prefect-server => many prefect-workers), but this crashed in a similar way (failed to reach api). I could also simply use prefect to trigger the flow and call dask normally from within the flow. Since it seems to be prefect struggling to keep track of all the tasks this would likely help - however I would lose the greatly increased visibility and logging that prefect provides at the task level