dagster-io / dagster

An orchestration platform for the development, production, and observation of data assets.
https://dagster.io
Apache License 2.0
11.67k stars 1.47k forks source link

Webserver at random times for random periods makes queries to dagster DB that drive CPU usage to near maximum #23500

Open mmutso-boku opened 3 months ago

mmutso-boku commented 3 months ago

Dagster version

1.7.13

What's the issue?

Around 3 days ago I noticed that the dagster DB (Postgre RDS) CPU usage is near 100% for sustained periods while normally this hasn't been the case. I eventually traced the queries to the webserver k8s pod.

I also do not think this is due to increased number of users using the dagster UI.

image Orange is queries from the webserver pod, purple is the daemon pod. As can be seen, the timing and length of these periods are quite random, and during these periods the database CPU usage is near 100%

As for what the queries are that are being "spammed": image The top 3 from the list:

SELECT pg_catalog.pg_attribute.attname AS name, pg_catalog.format_type(pg_catalog.pg_attribute.atttypid, pg_catalog.pg_attribute.atttypmod) AS format_type, (SELECT pg_catalog.pg_get_expr(pg_catalog.pg_attrdef.adbin, pg_catalog.pg_attrdef.adrelid) AS pg_get_expr_1 
FROM pg_catalog.pg_attrdef 
WHERE pg_catalog.pg_attrdef.adrelid = pg_catalog.pg_attribute.attrelid AND pg_catalog.pg_attrdef.adnum = pg_catalog.pg_attribute.attnum AND pg_catalog.pg_attribute.atthasdef) AS "default", pg_catalog.pg_attribute.attnotnull AS not_null, pg_catalog.pg_class.relname AS table_name, pg_catalog.pg_description.description AS comment, pg_catalog.pg_attribute.attgenerated AS generated, (SELECT json_build_object('always', pg_catalog.pg_attribute.attidentity = 'a', 'start', pg_catalog.pg_sequence.seqstart, 'increment', pg_catalog.pg_sequence.seqincrement, 'minvalue', pg_catalog.pg_sequence.seqmin, 'maxvalue', pg_catalog.pg_sequence.seqmax, 'cache', pg_catalog.pg_sequence.seqcache, 'cycle', pg_catalog.pg_sequence.seqcycle) AS json_build_object_1 
FROM pg_catalog.pg_sequence 
WHERE pg_catalog.pg_attribute.attidentity != '' AND pg_catalog.pg_sequence.seqrelid = CAST(CAST(pg_catalog.pg_get_serial_sequence(CAST(CAST(pg_catalog.pg_attribute.attrelid AS REGCLASS) AS TEXT), pg_catalog.pg_attribute.attname) AS REGCLASS) AS OID)) AS identity_options 
FROM pg_catalog.pg_class LEFT OUTER JOIN pg_catalog.pg_attribute ON pg_catalog.pg_class.oid = pg_catalog.pg_attribute.attrelid AND pg_catalog.pg_attribute.attnum > 0 AND NOT pg_catalog.pg_attribute.attisdropped LEFT OUTER JOIN pg_catalog.pg_description ON pg_catalog.pg_description.objoid = pg_catalog.pg_attribute.attrelid AND pg_catalog.pg_description.objsubid = pg_catalog.pg_attribute.attnum JOIN pg_catalog.pg_namespace ON pg_catalog.pg_namespace.oid = pg_catalog.pg_class.relnamespace 
WHERE pg_catalog.pg_class.relkind = ANY (ARRAY['r', 'p', 'f', 'v', 'm']) AND pg_catalog.pg_table_is_visible(pg_catalog.pg_class.oid) AND pg_catalog.pg_namespace.nspname != 'pg_catalog' AND pg_catalog.pg_class.relname IN ('asset_keys') ORDER BY pg_catalog.pg_class.relname, pg_catalog.pg_attribute.attnum
SELECT pg_catalog.pg_type.typname AS name, pg_catalog.format_type(pg_catalog.pg_type.typbasetype, pg_catalog.pg_type.typtypmod) AS attype, NOT pg_catalog.pg_type.typnotnull AS nullable, pg_catalog.pg_type.typdefault AS "default", pg_catalog.pg_type_is_visible(pg_catalog.pg_type.oid) AS visible, pg_catalog.pg_namespace.nspname AS schema, domain_constraints.condefs, domain_constraints.connames, pg_catalog.pg_collation.collname 
FROM pg_catalog.pg_type JOIN pg_catalog.pg_namespace ON pg_catalog.pg_namespace.oid = pg_catalog.pg_type.typnamespace LEFT OUTER JOIN pg_catalog.pg_collation ON pg_catalog.pg_type.typcollation = pg_catalog.pg_collation.oid LEFT OUTER JOIN (SELECT pg_catalog.pg_constraint.contypid AS contypid, array_agg(pg_catalog.pg_get_constraintdef(pg_catalog.pg_constraint.oid, true)) AS condefs, array_agg(CAST(pg_catalog.pg_constraint.conname AS TEXT)) AS connames 
FROM pg_catalog.pg_constraint 
WHERE pg_catalog.pg_constraint.contypid != 0 GROUP BY pg_catalog.pg_constraint.contypid) AS domain_constraints ON pg_catalog.pg_type.oid = domain_constraints.contypid 
WHERE pg_catalog.pg_type.typtype = 'd' ORDER BY pg_catalog.pg_namespace.nspname, pg_catalog.pg_type.typname
SELECT pg_catalog.pg_type.typname AS name, pg_catalog.pg_type_is_visible(pg_catalog.pg_type.oid) AS visible, pg_catalog.pg_namespace.nspname AS schema, lbl_agg.labels AS labels 
FROM pg_catalog.pg_type JOIN pg_catalog.pg_namespace ON pg_catalog.pg_namespace.oid = pg_catalog.pg_type.typnamespace LEFT OUTER JOIN (SELECT pg_catalog.pg_enum.enumtypid AS enumtypid, array_agg(CAST(pg_catalog.pg_enum.enumlabel AS TEXT) ORDER BY pg_catalog.pg_enum.enumsortorder) AS labels 
FROM pg_catalog.pg_enum GROUP BY pg_catalog.pg_enum.enumtypid) AS lbl_agg ON pg_catalog.pg_type.oid = lbl_agg.enumtypid 
WHERE pg_catalog.pg_type.typtype = 'e' ORDER BY pg_catalog.pg_namespace.nspname, pg_catalog.pg_type.typname

I am currently at a loss what could cause this, but perhaps you have ideas?

What did you expect to happen?

This has not happened before during normal operation, and it seems weird that webserver inflicts such heavy load to the dagster DB with seemingly weird queries.

How to reproduce?

Currently do not know how to reproduce

Deployment type

Dagster Helm chart

Deployment details

Dagster helm chart deployment - separate pods for the daemon, web server, read-only webserver, code locations and runs

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

mmutso-boku commented 3 months ago

Adding more images of different time periods.

This is for the past week: image The "spikes" are from webserver pod - orange, and before that different shade of purple.

Past 24h: image

kirkhansen commented 1 month ago

I'm seeing this behavior with dagster version 1.8.4 as well.

rileymcdowell commented 1 month ago

I have some information that may be related to this. Not only do we see the high CPU behavior in an AWS RDS Postgres database, but we also see jobs that sometimes fail to run when triggered via the webserver and graphql endpoints due to timeouts. These jobs do seem to run fine when triggered via normal schedule execution.

Here is an exerpt of the relevant part of the error message I see in dagster webserver during job launch failures.

  ...
  File "/usr/local/lib/python3.10/site-packages/dagster/_core/storage/event_log/sql_event_log.py", line 2783, in store_asset_check_event
    self._store_asset_check_evaluation_planned(event, event_id)
  File "/usr/local/lib/python3.10/site-packages/dagster/_core/storage/event_log/sql_event_log.py", line 2796, in _store_asset_check_evaluation_planned
    with self.index_connection() as conn:
  File "/usr/local/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.10/site-packages/dagster_postgres/utils.py", line 165, in create_pg_connection
    conn = retry_pg_connection_fn(engine.connect)
  File "/usr/local/lib/python3.10/site-packages/dagster_postgres/utils.py", line 129, in retry_pg_connection_fn
    raise DagsterPostgresException("too many retries for DB connection") from exc

The above exception was caused by the following exception:
sqlalchemy.exc.TimeoutError: QueuePool limit of size 1 overflow 10 reached, connection timed out, timeout 30.00 

Looking at dagster webserver's source, the QueuePool size is limited to 1 for the engine used to interact with events here. This is not configurable. The QueuePool "overflow" is not set by dagster webserver, but it defaults to 10 as specified in SQLAlchemy here. The pool timeout defaults to 30 seconds as specified in SQLAlchemy here. I believe the way the QueuePool works is that the QueuePool "size" is the desired number of connections. These are kept open whether they are in use or not. The "overflow" is additional connections above and beyond "size". Overflow connections are closed when they are no longer being used. If there are more than "size + overflow" connections requested, the application's request for a "size + overflow + 1" connection will block for the duration of the pool timeout. In this case, it seems there are 11 connections in use continuously for 30 seconds, which is triggering the error that produced the stacktrace above.

I'm not familiar with dagster webserver's design goals, but I wonder if it has some kind of connection pool "leak" if you will such that additional connections are made by dagster webserver up to the default maximum value of 10. These leaked connections remain in a loop doing some simple but CPU intensive queries, which drives up CPU use on the database. They also consume connection pool slots in dagster webserver, which can prevent large jobs from being submitted.

I would love to hear from somebody familiar with the dagster webserver design goals whether it's expected that the webserver would consume more than 1 connection from the QueuePool during normal operation.