dagster-io / dagster

An orchestration platform for the development, production, and observation of data assets.
https://dagster.io
Apache License 2.0
10.99k stars 1.38k forks source link

Schedules and Runs pages are unloadable in UI after upgrading to 1.7.11 #22935

Open brien-gleason opened 1 month ago

brien-gleason commented 1 month ago

Dagster version

1.7.11

What's the issue?

When I navigate to the Runs view or Schedules view of a specific job, I get an error and see GraphQL Error - see console for details.

Both pages return this error: Message: (psycopg2.errors.QueryCanceled) canceling statement due to statement timeout. The full stacktraces are in "Additional Information" section.

I don't see this error on previous versions and we did not change anything on our Postgres DB. Rolling back to version 1.7.10 fixes these two pages. The queries both complete successfully when running directly on the DB.

How to reproduce?

This happens for me every time I navigate to the Runs or Schedules views on version 1.7.11 or 1.7.12.

Deployment type

Dagster Helm chart

Deployment details

Kubernetes version is 1.28.5 and Postgres DB is an Azure Flex DB running on version 14.11

Additional information

The full Stacktrace for the Schedules page is:

Operation name: ScheduleRootQuery

Message: (psycopg2.errors.QueryCanceled) canceling statement due to statement timeout

[SQL: SELECT runs.id, runs.run_body, runs.status, runs.create_timestamp, runs.update_timestamp, runs.start_time, runs.end_time 
FROM runs 
WHERE runs.run_id IN (SELECT run_tags.run_id 
FROM run_tags 
WHERE run_tags.key = %(key_1)s AND run_tags.value = %(value_1)s OR run_tags.key = %(key_2)s AND run_tags.value = %(value_2)s GROUP BY run_tags.run_id 
HAVING count(DISTINCT run_tags.key) = %(count_1)s) ORDER BY runs.id DESC 
 LIMIT %(param_1)s]
[parameters: {'key_1': 'dagster/schedule_name', 'value_1': 'driver_pay_schedule', 'key_2': '.dagster/repository', 'value_2': '__repository__@postgres-replication', 'count_1': 2, 'param_1': 1}]
(Background on this error at: https://sqlalche.me/e/20/e3q8)

Path: ["scheduleOrError","scheduleState","runs"]

Locations: [{"line":84,"column":3}]

Stack Trace:
  File "/usr/local/lib/python3.10/site-packages/graphql/execution/execute.py", line 521, in execute_field
    result = resolve_fn(source, info, **args)
  File "/usr/local/lib/python3.10/site-packages/dagster_graphql/schema/instigation.py", line 656, in resolve_runs
    for record in graphene_info.context.instance.get_run_records(
  File "/usr/local/lib/python3.10/site-packages/dagster/_utils/__init__.py", line 694, in inner
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/dagster/_core/instance/__init__.py", line 1843, in get_run_records
    return self._run_storage.get_run_records(
  File "/usr/local/lib/python3.10/site-packages/dagster/_core/storage/runs/sql_run_storage.py", line 421, in get_run_records
    rows = self.fetchall(query)
  File "/usr/local/lib/python3.10/site-packages/dagster/_core/storage/runs/sql_run_storage.py", line 118, in fetchall
    return db_fetch_mappings(conn, query)
  File "/usr/local/lib/python3.10/site-packages/dagster/_core/storage/sqlalchemy_compat.py", line 35, in db_fetch_mappings
    return conn.execute(query).mappings().all()
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1418, in execute
    return meth(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/sql/elements.py", line 515, in _execute_on_connection
    return connection._execute_clauseelement(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1640, in _execute_clauseelement
    ret = self._execute_context(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1846, in _execute_context
    return self._exec_single_context(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1986, in _exec_single_context
    self._handle_dbapi_exception(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 2353, in _handle_dbapi_exception
    raise sqlalchemy_exception.with_traceback(exc_info[2]) from e
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1967, in _exec_single_context
    self.dialect.do_execute(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 924, in do_execute
    cursor.execute(statement, parameters)

The above exception was the direct cause of the following exception:

Message: psycopg2.errors.QueryCanceled: canceling statement due to statement timeout

Stack Trace:
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1967, in _exec_single_context
    self.dialect.do_execute(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 924, in do_execute
    cursor.execute(statement, parameters)

The full stacktrace for Runs page is:

Operation name: JobMetadataQuery

Message: (psycopg2.errors.QueryCanceled) canceling statement due to statement timeout

[SQL: SELECT runs.id, runs.run_body, runs.status, runs.create_timestamp, runs.update_timestamp, runs.start_time, runs.end_time 
FROM runs 
WHERE runs.pipeline_name = %(pipeline_name_1)s AND (EXISTS (SELECT * 
FROM run_tags 
WHERE runs.run_id = run_tags.run_id AND run_tags.key = %(key_1)s AND run_tags.value = %(value_1)s)) ORDER BY runs.id DESC 
 LIMIT %(param_1)s]
[parameters: {'pipeline_name_1': 'driver_pay_job', 'key_1': '.dagster/repository', 'value_1': '__repository__@postgres-replication', 'param_1': 5}]
(Background on this error at: https://sqlalche.me/e/20/e3q8)

Path: ["pipelineRunsOrError","results"]

Locations: [{"line":17,"column":7}]

Stack Trace:
  File "/usr/local/lib/python3.10/site-packages/graphql/execution/execute.py", line 521, in execute_field
    result = resolve_fn(source, info, **args)
  File "/usr/local/lib/python3.10/site-packages/dagster_graphql/schema/runs.py", line 116, in resolve_results
    return get_runs(graphene_info, self._filters, self._cursor, self._limit)
  File "/usr/local/lib/python3.10/site-packages/dagster_graphql/implementation/fetch_runs.py", line 128, in get_runs
    for record in instance.get_run_records(filters=filters, cursor=cursor, limit=limit)
  File "/usr/local/lib/python3.10/site-packages/dagster/_utils/__init__.py", line 694, in inner
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/dagster/_core/instance/__init__.py", line 1843, in get_run_records
    return self._run_storage.get_run_records(
  File "/usr/local/lib/python3.10/site-packages/dagster/_core/storage/runs/sql_run_storage.py", line 421, in get_run_records
    rows = self.fetchall(query)
  File "/usr/local/lib/python3.10/site-packages/dagster/_core/storage/runs/sql_run_storage.py", line 118, in fetchall
    return db_fetch_mappings(conn, query)
  File "/usr/local/lib/python3.10/site-packages/dagster/_core/storage/sqlalchemy_compat.py", line 35, in db_fetch_mappings
    return conn.execute(query).mappings().all()
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1418, in execute
    return meth(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/sql/elements.py", line 515, in _execute_on_connection
    return connection._execute_clauseelement(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1640, in _execute_clauseelement
    ret = self._execute_context(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1846, in _execute_context
    return self._exec_single_context(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1986, in _exec_single_context
    self._handle_dbapi_exception(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 2353, in _handle_dbapi_exception
    raise sqlalchemy_exception.with_traceback(exc_info[2]) from e
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1967, in _exec_single_context
    self.dialect.do_execute(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 924, in do_execute
    cursor.execute(statement, parameters)

The above exception was the direct cause of the following exception:

Message: psycopg2.errors.QueryCanceled: canceling statement due to statement timeout

Stack Trace:
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1967, in _exec_single_context
    self.dialect.do_execute(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 924, in do_execute
    cursor.execute(statement, parameters)

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

paralyzah commented 1 month ago

I'm almost sure the main reason of this behavior terribly poor performance of filtering runs by tags. Your repro above confirms it. https://github.com/dagster-io/dagster/blob/1.7.13/python_modules/dagster/dagster/_core/storage/runs/sql_run_storage.py#L344-L360

This is generic approach, for sure, but with current database design performance for this particular query will slow down as the number of runs/runs tags increases. I guess you faced this because your database accumulated a lot of runs and runs tags as well. We got the same behavior on 1.6.8 with MySQL for schedules/sensors pages. Currently we have around 100k runs and 700k runs tags.

FYI: relational databases always have an issues with multiple filtering conditions combined by OR clause.