sql_lab.get_sql_results encountering psycopg2.errors.SerializationFailure

mapledan commented 2 months ago

Bug description

In version 4.0.2, my sql_lab async tasks were functioning correctly with Celery. However, after upgrading to version 4.1.0rc2, I started encountering errors when executing sql_lab.get_sql_results. The tasks are now failing with a psycopg2.errors.SerializationFailure.

My databse is PostgreSQL 15. Here is my current super_config:

GLOBAL_ASYNC_QUERIES_REDIS_CONFIG = {
    "port": f"{REDIS_PORT}",
    "host": f"{REDIS_HOST}",
    "db": 0,
    "ssl": False,
}
GLOBAL_ASYNC_QUERIES_REDIS_STREAM_PREFIX = "async-events-"
GLOBAL_ASYNC_QUERIES_REDIS_STREAM_LIMIT = 1000
GLOBAL_ASYNC_QUERIES_REDIS_STREAM_LIMIT_FIREHOSE = 1000000
GLOBAL_ASYNC_QUERIES_JWT_COOKIE_NAME = "async-token"
GLOBAL_ASYNC_QUERIES_JWT_COOKIE_SECURE = True
GLOBAL_ASYNC_QUERIES_JWT_COOKIE_SAMESITE: None
GLOBAL_ASYNC_QUERIES_JWT_COOKIE_DOMAIN = None
GLOBAL_ASYNC_QUERIES_JWT_SECRET = "..."
GLOBAL_ASYNC_QUERIES_TRANSPORT = "ws"
GLOBAL_ASYNC_QUERIES_WEBSOCKET_URL = "wss://.../ws"

RESULTS_BACKEND = RedisCache(
    host=REDIS_HOST, port=REDIS_PORT, db=REDIS_RESULTS_DB, key_prefix='superset_result_')

class CeleryConfig(object):
    broker_url = f"redis://{REDIS_HOST}:{REDIS_PORT}/{REDIS_CELERY_DB}"
    imports = ("superset.sql_lab", "superset.tasks.scheduler")
    result_backend = f"redis://{REDIS_HOST}:{REDIS_PORT}/{REDIS_RESULTS_DB}"
    log_level = "DEBUG"
    worker_concurrency = 64
    worker_prefetch_multiplier = 2
    task_acks_late = False
    task_time_limit = 600
    task_annotations = {
        "sql_lab.get_sql_results": {
            "rate_limit": "5/s",
        },
        "email_reports.send": {
            "rate_limit": "5/s",
            "time_limit": 180,
            "soft_time_limit": 240,
            "ignore_result": True,
        },
    }

CELERY_CONFIG = CeleryConfig

How to reproduce the bug

Upgrade from version 4.0.2 to 4.1.0rc2.
Set the async query setting.
Run query in SQL Lab.

Screenshots/recordings

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1910, in _execute_context
    self.dialect.do_execute(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 736, in do_execute
    cursor.execute(statement, parameters)
psycopg2.errors.SerializationFailure: could not serialize access due to concurrent update

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/app/superset/sql_lab.py", line 182, in get_sql_results
    return execute_sql_statements(
  File "/app/superset/sql_lab.py", line 440, in execute_sql_statements
    db.session.commit()
  File "<string>", line 2, in commit
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1454, in commit
    self._transaction.commit(_to_root=self.future)
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 832, in commit
    self._prepare_impl()
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 811, in _prepare_impl
    self.session.flush()
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 3449, in flush
    self._flush(objects)
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 3588, in _flush
    with util.safe_reraise():
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
    compat.raise_(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 211, in raise_
    raise exception
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 3549, in _flush
    flush_context.execute()
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/unitofwork.py", line 456, in execute
    rec.execute(self)
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/unitofwork.py", line 630, in execute
    util.preloaded.orm_persistence.save_obj(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/persistence.py", line 237, in save_obj
    _emit_update_statements(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/persistence.py", line 1001, in _emit_update_statements
    c = connection._execute_20(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1710, in _execute_20
    return meth(self, args_10style, kwargs_10style, execution_options)
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/sql/elements.py", line 334, in _execute_on_connection
    return connection._execute_clauseelement(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1577, in _execute_clauseelement
    ret = self._execute_context(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1953, in _execute_context
    self._handle_dbapi_exception(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 2134, in _handle_dbapi_exception
    util.raise_(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 211, in raise_
    raise exception
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1910, in _execute_context
    self.dialect.do_execute(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 736, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (psycopg2.errors.SerializationFailure) could not serialize access due to concurrent update

[SQL: UPDATE query SET status=%(status)s, start_running_time=%(start_running_time)s, changed_on=%(changed_on)s WHERE query.id = %(query_id)s]
[parameters: {'status': <QueryStatus.RUNNING: 'running'>, 'start_running_time': 1725005762062.906, 'changed_on': datetime.datetime(2024, 8, 30, 8, 16, 2, 66679), 'query_id': 77114}]
(Background on this error at: https://sqlalche.me/e/14/e3q8)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/app/superset/sql_lab.py", line 157, in get_query
    return db.session.query(Query).filter_by(id=query_id).one()
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 2870, in one
    return self._iter().one()
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 2916, in _iter
    result = self.session.execute(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1716, in execute
    conn = self._connection_for_bind(bind)
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1555, in _connection_for_bind
    return self._transaction._connection_for_bind(
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 724, in _connection_for_bind
    self._assert_active()
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 604, in _assert_active
    raise sa_exc.PendingRollbackError(
sqlalchemy.exc.PendingRollbackError: This Session's transaction has been rolled back due to a previous exception during flush. To begin a new transaction with this Session, first issue Session.rollback(). Original exception was: (psycopg2.errors.SerializationFailure) could not serialize access due to concurrent update

[SQL: UPDATE query SET status=%(status)s, start_running_time=%(start_running_time)s, changed_on=%(changed_on)s WHERE query.id = %(query_id)s]
[parameters: {'status': <QueryStatus.RUNNING: 'running'>, 'start_running_time': 1725005762062.906, 'changed_on': datetime.datetime(2024, 8, 30, 8, 16, 2, 66679), 'query_id': 77114}]
(Background on this error at: https://sqlalche.me/e/14/e3q8) (Background on this error at: https://sqlalche.me/e/14/7s2a)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/celery/app/trace.py", line 453, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/app/superset/initialization/__init__.py", line 111, in __call__
    return task_base.__call__(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/celery/app/trace.py", line 736, in __protected_call__
    return self.run(*args, **kwargs)
  File "/app/superset/sql_lab.py", line 194, in get_sql_results
    query = get_query(query_id)
  File "/usr/local/lib/python3.10/site-packages/backoff/_sync.py", line 105, in retry
    ret = target(*args, **kwargs)
  File "/app/superset/sql_lab.py", line 159, in get_query
    raise SqlLabException("Failed at getting query") from ex
superset.sql_lab.SqlLabException: Failed at getting query

Superset version

master / latest-dev

Python version

3.10

Node version

16

Browser

Chrome

Additional context

No response

Checklist

[X] I have searched Superset docs and Slack and didn't find a solution to my problem.
[X] I have searched the GitHub issue tracker and didn't find a similar bug report.
[X] I have checked Superset's logs for errors and if I found a relevant Python stacktrace, I included it here as text in the "additional context" section.

dosubot[bot] commented 2 months ago

@mapledan Hello! I'm here to assist you with any bugs or questions you have. If you need any help, feel free to ask.

The psycopg2.errors.SerializationFailure error you're encountering after upgrading from version 4.0.2 to 4.1.0rc2 is likely due to changes in how SQL statements are executed and committed. Specifically, the execute_sql_statements function now includes a commit operation for certain types of queries:

# Commit the connection so CTA queries will create the table and any DML.
should_commit = (
    not db_engine_spec.is_select_query(parsed_query)  # check if query is DML
    or apply_ctas
)
if should_commit:
    conn.commit()

This commit operation can lead to serialization failures in PostgreSQL, especially in high-concurrency environments or when using certain isolation levels. The SerializationFailure error is typically a result of concurrent transactions interfering with each other, which is more likely to occur with the added commit operation ^[1].

To mitigate this issue, you can set the isolation_level to READ COMMITTED in the SQLALCHEMY_ENGINE_OPTIONS for PostgreSQL databases. This setting can help reduce the likelihood of SerializationFailure errors by using a less strict isolation level:

if make_url(SQLALCHEMY_DATABASE_URI).get_backend_name() in ("postgresql", "mysql"):
    SQLALCHEMY_ENGINE_OPTIONS["isolation_level"] = "READ COMMITTED"  # noqa: F405

This configuration can be added to your Superset configuration file ^[2].

To continue talking to Dosu, mention @dosu.

mapledan commented 2 months ago

My PostgreSQL configuration default_transaction_isolation and transaction_isolation are already set to read committed.

michael-s-molina commented 2 months ago

@mapledan I linked your issue here as instructed by our release process. If you find any other issues while testing 4.1.0 RC2, please add the link to it there so we can track. Thanks for testing the RC!

mistercrunch commented 2 months ago

Curious, does this error occur if you switch to read-uncommitted?

I don't know much about GLOBAL_ASYNC_QUERIES is implemented, but I'd expect to see this kind of errors in polling environments, and would tend to recommend read-uncommitted there, at least for scans on the query table. Picturing hundreds of concurrent users in SQL Lab polling the query table every second, it becomes contentious for any update on that table while there's always a scan in-progress as reads can overlap one-another and updates are locked out.

read-uncommitted for an app like Superset where most mutations are user-specific and non-concurrent should be absolutely fine. This is not bank updating account balances. We have bigger issues around dashboard editing concurrency that are more app-level than database level, meaning if two users open and edit a dashboard at the same time, the second to save will overwrite the first user's changes - this is an app-level issue not a database isolation issue...

If people want to maintain a higher level for transaction integrity I'd suggest lowering the transaction level selectively on the scans across the app, particularly around the query table.

Anyone can think of potential issues using read-uncommitted? I think it should be the suggested / supported default.

michael-s-molina commented 2 months ago

Anyone can think of potential issues using read-uncommitted? I think it should be the suggested / supported default.

@villebro Will the shared lock you implemented be affected by this change?

https://github.com/apache/superset/pull/28628 made both Postgres and MySQL isolation level consistent so we if decide to change this configuration, let's keep it consistent.

mistercrunch commented 2 months ago

shared lock

curious to learn more about that

https://github.com/apache/superset/pull/28628 made both Postgres and MySQL isolation level consistent

Personally I think read-uncommitted is probably a more suited default

villebro commented 2 months ago

@mistercrunch @michael-s-molina I think we're getting things mixed up here - AFAIK this error doesn't have anything to do with Global Async Queries (I believe that part of the config was unnecessarily pasted here). Therefore, this regression appears to be related to how async execution is happening in SQL Lab (not in scope for current GAQ implementation).

mistercrunch commented 2 months ago

Yeah the error is triggered on the creation of the models.Query commit() call. Seems polling would lead to more locks potentially (?) Say if the Query model is under constant scan (hundreds of users polling, next scans starts before the previous one ends) then updates can be locked out I guess.

mistercrunch commented 2 months ago

One idea that may help if you want to try it @mapledan

from sqlalchemy import text
from sqlalchemy.engine import Connection

@staticmethod
def get_queries_changed_after(last_updated_ms: Union[float, int]) -> list[Query]:
    # UTC date time, same that is stored in the DB.
    last_updated_dt = datetime.utcfromtimestamp(last_updated_ms / 1000)

    # Check the database dialect and apply AUTOCOMMIT only for supported databases
    if db.session.bind.dialect.name in ["postgresql", "mysql"]:
        return (
            db.session.query(Query)
            .filter(Query.user_id == get_user_id(), Query.changed_on >= last_updated_dt)
            .execution_options(isolation_level="AUTOCOMMIT")  # Use AUTOCOMMIT for compatible databases
            .all()
        )
    else:
        return (
            db.session.query(Query)
            .filter(Query.user_id == get_user_id(), Query.changed_on >= last_updated_dt)
            .all()
        )

in this module -> superset/models/sql_lab.py , for the record this was generated by GPT but LGTM - could be more DRY though if we decided to open a PR with this idea.

Wanna try it and report-back?

BTW I double checked that we have a proper index on user_id, changed_on to make sure that polling query is as fast as can be.

mapledan commented 2 months ago

One idea that may help if you want to try it @mapledan

from sqlalchemy import text
from sqlalchemy.engine import Connection

@staticmethod
def get_queries_changed_after(last_updated_ms: Union[float, int]) -> list[Query]:
    # UTC date time, same that is stored in the DB.
    last_updated_dt = datetime.utcfromtimestamp(last_updated_ms / 1000)

    # Check the database dialect and apply AUTOCOMMIT only for supported databases
    if db.session.bind.dialect.name in ["postgresql", "mysql"]:
        return (
            db.session.query(Query)
            .filter(Query.user_id == get_user_id(), Query.changed_on >= last_updated_dt)
            .execution_options(isolation_level="AUTOCOMMIT")  # Use AUTOCOMMIT for compatible databases
            .all()
        )
    else:
        return (
            db.session.query(Query)
            .filter(Query.user_id == get_user_id(), Query.changed_on >= last_updated_dt)
            .all()
        )

in this module -> superset/models/sql_lab.py , for the record this was generated by GPT but LGTM - could be more DRY though if we decided to open a PR with this idea.

Wanna try it and report-back?

BTW I double checked that we have a proper index on user_id, changed_on to make sure that polling query is as fast as can be.

Thanks for providing the information and assistance! I tried adding execution_options(isolation_level="AUTOCOMMIT") in superset/daos/query.py#68, but the error still persists.

However, after adjusting the following settings in config.py, everything started working as expected:

SQLALCHEMY_ENGINE_OPTIONS = {
    "isolation_level": "READ COMMITTED",
}

It seems the issue might not be related to get_queries_changed_after.

michael-s-molina commented 2 months ago

@mistercrunch @mapledan I believe there's an error on https://github.com/apache/superset/pull/28628:

# The default MySQL isolation level is REPEATABLE READ whereas the default PostgreSQL
# isolation level is READ COMMITTED. All backends should use READ COMMITTED (or similar)
# to help ensure consistent behavior.
SQLALCHEMY_ENGINE_OPTIONS = {
    "isolation_level": "SERIALIZABLE",  # SQLite does not support READ COMMITTED.
}

If you read the comment, it seems the isolation level should be set to READ COMMITTED for both MySQL and Postgres and to SERIALIZABLE for SQLite which does not support READ COMMITTED.

In that same PR, there's a test with the following:

if make_url(SQLALCHEMY_DATABASE_URI).get_backend_name() in ("postgresql", "mysql"):
    SQLALCHEMY_ENGINE_OPTIONS["isolation_level"] = "READ COMMITTED"  # noqa: F405

@sadpandajoe added a comment about this but never got a reply.

In summary, I think the fix is to use READ COMMITTED for both MySQL and Postgres and SERIALIZABLE for SQLite.

mistercrunch commented 2 months ago

@michael-s-molina your sqlite comment make sense, though now I'm confused. @mapledan reports that changing his Postgres to READ COMMITTED fixes the issue, even though we're pointing to a PR that seems to set the default to READ COMMITTED(?) Why was @mapledan 's isolation_level not set by this default (?) Did you have another setting?

michael-s-molina commented 2 months ago

@michael-s-molina your sqlite comment make sense, though now I'm confused. @mapledan reports that changing his Postgres to READ COMMITTED fixes the issue, even though we're pointing to a PR that seems to set the default to READ COMMITTED(?) Why was @mapledan 's isolation_level not set by this default (?) Did you have another setting?

The PR that I linked is NOT setting Postgres to READ COMMITTED, it's setting to SERIALIZABLE. I think the fix is to use READ COMMITTED for both MySQL and Postgres and SERIALIZABLE for SQLite.

villebro commented 2 months ago

I think it would be a better idea to provide sane defaults in the extensions of BaseEngineSpec that get applied to SQLALCHEMY_ENGINE_OPTIONS and can be overridden via superset_config.py, rather than having a global default in config.py that doesn't globally work..

michael-s-molina commented 2 months ago

I think it would be a better idea to provide sane defaults in the extensions of BaseEngineSpec that get applied to SQLALCHEMY_ENGINE_OPTIONS and can be overridden via superset_config.py, rather than having a global default in config.py that doesn't globally work..

@villebro That's what I meant with

I think the fix is to use READ COMMITTED for both MySQL and Postgres and SERIALIZABLE for SQLite.

A global configuration won't work because you may have multiple database connections with different transaction isolation levels.

mistercrunch commented 2 months ago

Oh right! I had assumed given the title of the PR, but clearly sets a default that doesn't match the PR title. Good catch.

mistercrunch commented 2 months ago

I think it would be a better idea to provide sane defaults in the extensions of BaseEngineSpec

Mmmh, I get the idea, but I'd say the db_engine_spec abstraction is there to serve analytics-type workloads. I'd say for analytics workload on postgres you'd want AUTOCOMMIT, as typically you want as little as possible isolation overhead. Very different from OLTP. Say if you happen to use Postgres for both metadata db and analytics workload, you'd probably want READ COMMITTED for the metadata db and AUTOCOMMIT for analytics...

I'd open PR with a dynamic default for config.py, only specifying READ COMMITTED for mysql and postgres, but realizing that at config.py parsing-time we may not know what the database connection look like. Maybe using default logic in DB_CONNECTION_MUTATOR (?). Or should we just not specify SQLALCHEMY_ENGINE_OPTIONS in config.py?

mistercrunch commented 2 months ago

Submitting this DRAFT PR for review -> https://github.com/apache/superset/pull/30174

villebro commented 2 months ago

I think it would be a better idea to provide sane defaults in the extensions of BaseEngineSpec

Mmmh, I get the idea, but I'd say the db_engine_spec abstraction is there to serve analytics-type workloads. I'd say for analytics workload on postgres you'd want AUTOCOMMIT, as typically you want as little as possible isolation overhead. Very different from OLTP. Say if you happen to use Postgres for both metadata db and analytics workload, you'd probably want READ COMMITTED for the metadata db and AUTOCOMMIT for analytics...

I'd open PR with a dynamic default for config.py, only specifying READ COMMITTED for mysql and postgres, but realizing that at config.py parsing-time we may not know what the database connection look like. Maybe using default logic in DB_CONNECTION_MUTATOR (?). Or should we just not specify SQLALCHEMY_ENGINE_OPTIONS in config.py?

Yeah you're absolutely right, my bad 👍 However, I'm still not entirely buying into having this in config.py, but let me sleep on it and see if I'm wiser in the morning.

michael-s-molina commented 2 months ago

Mmmh, I get the idea, but I'd say the db_engine_spec abstraction is there to serve analytics-type workloads. I'd say for analytics workload on postgres you'd want AUTOCOMMIT, as typically you want as little as possible isolation overhead. Very different from OLTP. Say if you happen to use Postgres for both metadata db and analytics workload, you'd probably want READ COMMITTED for the metadata db and AUTOCOMMIT for analytics...

Good point 👍🏼

michael-s-molina commented 2 months ago

@mistercrunch @villebro Do you think this configuration could belong to the database connection modal? Maybe in the Performance section?

It could be a select with SQL Alchemy options with the default value coming from the db engine spec. We could have a method called default_transaction_isolation_level that will return a value or None for databases that don't support isolation levels. The option to change the value would only appear if a value other than None is received.

Database	Isolation Level
MySQL	REPEATABLE READ
PostgreSQL	READ COMMITTED
Oracle	READ COMMITTED
SQL Server	READ COMMITTED
SQLite	SERIALIZABLE
Trino	None
Druid	None
Hive	READ COMMITTED
Presto	READ COMMITTED

This would allow us to change the isolation level for for both metadata and analytical databases.

mistercrunch commented 2 months ago

do you think this configuration could belong to the database connection modal?

I think this setting matters most for the metadata database settings as it relates to OLTP-type workloads, that db connection is not exposed in the UI. For the metadata db, we should just pick a solid default (READ COMMITTED) and stick to it for CI/tests/production (and to @john-bodley 's point, consistent across engines as much as possible). Also we should probably officially only support mysql and postgres, everything else is "use at your own risk". Wanna run it on Oracle or MSSQL? Good luck as we don't have licenses and can't blow up our build matrix.

It matters much less for the analytics-type workload, especially for single query per session that are just SELECTs. With something like Airflow or dbt doing typically single INSERT/UPDATE statement per "session". There's rarely session/rollback management in your typical analytics database workload. Maybe at times you'd have some stored procedure type workload with ROLLBACK logic on the ETL side

villebro commented 2 months ago

I agree with @mistercrunch , IIUC the issue here is limited to the metastore connection, not analytical db connections (for which the modal is relevant). So I'm leaning towards having sane defaults that can be overridden if needed. I propose moving the discussion to Max's PR (I'll take a look now).

mistercrunch commented 2 months ago

link to PR for convenience -> https://github.com/apache/superset/pull/30174

apache / superset