Closed ashtuchkin closed 2 months ago
@ashtuchkin I'll comment here once this is officially released in 2.20.7 later this week; if you're interested in testing this prior to release, you can install off our 2.x
branch via pip install -U git+https://github.com/PrefectHQ/prefect.git@2.x
once #15289 is merged!
@cicdw Thanks for fixing this issue. When can we expect the official release of 2.20.7
?
I just cut the release moments ago @jashwanth9 ! It should go live on PyPI imminently.
Thank you @cicdw !
Bug summary
We have a medium-sized Prefect deployment on AWS EKS cluster with RDS Postgres database. Recently we started using a lot of subflows, accumulating about 50k of them (most in complete state). Last couple of days we were fire-fighting the deployment falling over due to all 3 pods of Prefect server being overloaded (100% CPU) and everything being super slow, late flows accumulating, etc.
After investigation, we realized that the issue was with CancellationCleanup loop taking about 5 minutes to run and using ~60-70% of CPU, also adding unreasonable load to the database. After finishing, the loop immediately starts from beginning, making the whole server starved for resources and failing in a lot of other places. We checked it's the culprit by disabling all the loops one by one and checking CPU usage, database load and overall responsiveness of the web interface.
Specifically what looks like happens there is that in
clean_up_cancelled_subflow_runs
function, we go through ALL subflows in the database (in all states, including completed ones), and then for each of them run_cancel_subflow
. That initial query seems to be pretty heavy as it also preloads corresponding flow_run_state etc.My guess is that this query is not doing what we expect it to do - maybe
db.FlowRun.id > high_water_mark,
need to be moved into the AND expression?https://github.com/PrefectHQ/prefect/blob/2.x/src/prefect/server/services/cancellation_cleanup.py#L79-L92
Version info (
prefect version
output)Additional context
No response