CancellationCleanup service pegs CPU to 100%

ashtuchkin commented 2 months ago

Bug summary

We have a medium-sized Prefect deployment on AWS EKS cluster with RDS Postgres database. Recently we started using a lot of subflows, accumulating about 50k of them (most in complete state). Last couple of days we were fire-fighting the deployment falling over due to all 3 pods of Prefect server being overloaded (100% CPU) and everything being super slow, late flows accumulating, etc.

After investigation, we realized that the issue was with CancellationCleanup loop taking about 5 minutes to run and using ~60-70% of CPU, also adding unreasonable load to the database. After finishing, the loop immediately starts from beginning, making the whole server starved for resources and failing in a lot of other places. We checked it's the culprit by disabling all the loops one by one and checking CPU usage, database load and overall responsiveness of the web interface.

Specifically what looks like happens there is that in clean_up_cancelled_subflow_runs function, we go through ALL subflows in the database (in all states, including completed ones), and then for each of them run _cancel_subflow. That initial query seems to be pretty heavy as it also preloads corresponding flow_run_state etc.

My guess is that this query is not doing what we expect it to do - maybe db.FlowRun.id > high_water_mark, need to be moved into the AND expression?

https://github.com/PrefectHQ/prefect/blob/2.x/src/prefect/server/services/cancellation_cleanup.py#L79-L92

                sa.select(db.FlowRun)
                .where(
                    or_(
                        db.FlowRun.state_type == states.StateType.PENDING,
                        db.FlowRun.state_type == states.StateType.SCHEDULED,
                        db.FlowRun.state_type == states.StateType.RUNNING,
                        db.FlowRun.state_type == states.StateType.PAUSED,
                        db.FlowRun.state_type == states.StateType.CANCELLING,
                        db.FlowRun.id > high_water_mark,
                    ),
                    db.FlowRun.parent_task_run_id.is_not(None),
                )
                .order_by(db.FlowRun.id)
                .limit(self.batch_size)

Version info (`prefect version` output)

We're using helm chart https://prefecthq.github.io/prefect-helm version 2024.6.28162841 in our AWS EKS kubernetes cluster.
Database: AWS RDS Postgres.
Image: prefecthq/prefect:2.19.7-python3.10

Here's `prefect version` from the client:
Version:             2.19.7
API version:         0.8.4
Python version:      3.11.7
Git commit:          60f05122
Built:               Fri, Jun 28, 2024 11:27 AM
OS/Arch:             linux/x86_64
Profile:             default
Server type:         ephemeral
Server:
  Database:          sqlite
  SQLite version:    3.40.1

Additional context

No response

cicdw commented 2 months ago

@ashtuchkin I'll comment here once this is officially released in 2.20.7 later this week; if you're interested in testing this prior to release, you can install off our 2.x branch via pip install -U git+https://github.com/PrefectHQ/prefect.git@2.x once #15289 is merged!

jashwanth9 commented 2 months ago

@cicdw Thanks for fixing this issue. When can we expect the official release of 2.20.7?

cicdw commented 2 months ago

I just cut the release moments ago @jashwanth9 ! It should go live on PyPI imminently.

ashtuchkin commented 1 month ago

Thank you @cicdw !

PrefectHQ / prefect