Open mmutso-boku opened 3 months ago
Adding more images of different time periods.
This is for the past week: The "spikes" are from webserver pod - orange, and before that different shade of purple.
Past 24h:
I'm seeing this behavior with dagster version 1.8.4 as well.
I have some information that may be related to this. Not only do we see the high CPU behavior in an AWS RDS Postgres database, but we also see jobs that sometimes fail to run when triggered via the webserver and graphql endpoints due to timeouts. These jobs do seem to run fine when triggered via normal schedule execution.
Here is an exerpt of the relevant part of the error message I see in dagster webserver during job launch failures.
...
File "/usr/local/lib/python3.10/site-packages/dagster/_core/storage/event_log/sql_event_log.py", line 2783, in store_asset_check_event
self._store_asset_check_evaluation_planned(event, event_id)
File "/usr/local/lib/python3.10/site-packages/dagster/_core/storage/event_log/sql_event_log.py", line 2796, in _store_asset_check_evaluation_planned
with self.index_connection() as conn:
File "/usr/local/lib/python3.10/contextlib.py", line 135, in __enter__
return next(self.gen)
File "/usr/local/lib/python3.10/site-packages/dagster_postgres/utils.py", line 165, in create_pg_connection
conn = retry_pg_connection_fn(engine.connect)
File "/usr/local/lib/python3.10/site-packages/dagster_postgres/utils.py", line 129, in retry_pg_connection_fn
raise DagsterPostgresException("too many retries for DB connection") from exc
The above exception was caused by the following exception:
sqlalchemy.exc.TimeoutError: QueuePool limit of size 1 overflow 10 reached, connection timed out, timeout 30.00
Looking at dagster webserver's source, the QueuePool size is limited to 1 for the engine used to interact with events here. This is not configurable. The QueuePool "overflow" is not set by dagster webserver, but it defaults to 10 as specified in SQLAlchemy here. The pool timeout defaults to 30 seconds as specified in SQLAlchemy here. I believe the way the QueuePool works is that the QueuePool "size" is the desired number of connections. These are kept open whether they are in use or not. The "overflow" is additional connections above and beyond "size". Overflow connections are closed when they are no longer being used. If there are more than "size + overflow" connections requested, the application's request for a "size + overflow + 1" connection will block for the duration of the pool timeout. In this case, it seems there are 11 connections in use continuously for 30 seconds, which is triggering the error that produced the stacktrace above.
I'm not familiar with dagster webserver's design goals, but I wonder if it has some kind of connection pool "leak" if you will such that additional connections are made by dagster webserver up to the default maximum value of 10. These leaked connections remain in a loop doing some simple but CPU intensive queries, which drives up CPU use on the database. They also consume connection pool slots in dagster webserver, which can prevent large jobs from being submitted.
I would love to hear from somebody familiar with the dagster webserver design goals whether it's expected that the webserver would consume more than 1 connection from the QueuePool during normal operation.
Dagster version
1.7.13
What's the issue?
Around 3 days ago I noticed that the dagster DB (Postgre RDS) CPU usage is near 100% for sustained periods while normally this hasn't been the case. I eventually traced the queries to the webserver k8s pod.
I also do not think this is due to increased number of users using the dagster UI.
Orange is queries from the webserver pod, purple is the daemon pod. As can be seen, the timing and length of these periods are quite random, and during these periods the database CPU usage is near 100%
As for what the queries are that are being "spammed": The top 3 from the list:
I am currently at a loss what could cause this, but perhaps you have ideas?
What did you expect to happen?
This has not happened before during normal operation, and it seems weird that webserver inflicts such heavy load to the dagster DB with seemingly weird queries.
How to reproduce?
Currently do not know how to reproduce
Deployment type
Dagster Helm chart
Deployment details
Dagster helm chart deployment - separate pods for the daemon, web server, read-only webserver, code locations and runs
Additional information
No response
Message from the maintainers
Impacted by this issue? Give it a 👍! We factor engagement into prioritization.