Closed mapledan closed 2 months ago
@mapledan Hello! I'm here to assist you with any bugs or questions you have. If you need any help, feel free to ask.
The psycopg2.errors.SerializationFailure
error you're encountering after upgrading from version 4.0.2 to 4.1.0rc2 is likely due to changes in how SQL statements are executed and committed. Specifically, the execute_sql_statements
function now includes a commit operation for certain types of queries:
# Commit the connection so CTA queries will create the table and any DML.
should_commit = (
not db_engine_spec.is_select_query(parsed_query) # check if query is DML
or apply_ctas
)
if should_commit:
conn.commit()
This commit operation can lead to serialization failures in PostgreSQL, especially in high-concurrency environments or when using certain isolation levels. The SerializationFailure
error is typically a result of concurrent transactions interfering with each other, which is more likely to occur with the added commit operation [1].
To mitigate this issue, you can set the isolation_level
to READ COMMITTED
in the SQLALCHEMY_ENGINE_OPTIONS
for PostgreSQL databases. This setting can help reduce the likelihood of SerializationFailure
errors by using a less strict isolation level:
if make_url(SQLALCHEMY_DATABASE_URI).get_backend_name() in ("postgresql", "mysql"):
SQLALCHEMY_ENGINE_OPTIONS["isolation_level"] = "READ COMMITTED" # noqa: F405
This configuration can be added to your Superset configuration file [2].
To continue talking to Dosu, mention @dosu.
My PostgreSQL configuration default_transaction_isolation
and transaction_isolation
are already set to read committed
.
@mapledan I linked your issue here as instructed by our release process. If you find any other issues while testing 4.1.0 RC2, please add the link to it there so we can track. Thanks for testing the RC!
Curious, does this error occur if you switch to read-uncommitted?
I don't know much about GLOBAL_ASYNC_QUERIES
is implemented, but I'd expect to see this kind of errors in polling environments, and would tend to recommend read-uncommitted
there, at least for scans on the query table. Picturing hundreds of concurrent users in SQL Lab polling the query
table every second, it becomes contentious for any update on that table while there's always a scan in-progress as reads can overlap one-another and updates are locked out.
read-uncommitted
for an app like Superset where most mutations are user-specific and non-concurrent should be absolutely fine. This is not bank updating account balances. We have bigger issues around dashboard editing concurrency that are more app-level than database level, meaning if two users open and edit a dashboard at the same time, the second to save will overwrite the first user's changes - this is an app-level issue not a database isolation issue...
If people want to maintain a higher level for transaction integrity I'd suggest lowering the transaction level selectively on the scans across the app, particularly around the query
table.
Anyone can think of potential issues using read-uncommitted
? I think it should be the suggested / supported default.
Anyone can think of potential issues using read-uncommitted? I think it should be the suggested / supported default.
@villebro Will the shared lock you implemented be affected by this change?
https://github.com/apache/superset/pull/28628 made both Postgres and MySQL isolation level consistent so we if decide to change this configuration, let's keep it consistent.
shared lock
curious to learn more about that
https://github.com/apache/superset/pull/28628 made both Postgres and MySQL isolation level consistent
Personally I think read-uncommitted is probably a more suited default
@mistercrunch @michael-s-molina I think we're getting things mixed up here - AFAIK this error doesn't have anything to do with Global Async Queries (I believe that part of the config was unnecessarily pasted here). Therefore, this regression appears to be related to how async execution is happening in SQL Lab (not in scope for current GAQ implementation).
Yeah the error is triggered on the creation of the models.Query
commit() call. Seems polling would lead to more locks potentially (?) Say if the Query model is under constant scan (hundreds of users polling, next scans starts before the previous one ends) then updates can be locked out I guess.
One idea that may help if you want to try it @mapledan
from sqlalchemy import text
from sqlalchemy.engine import Connection
@staticmethod
def get_queries_changed_after(last_updated_ms: Union[float, int]) -> list[Query]:
# UTC date time, same that is stored in the DB.
last_updated_dt = datetime.utcfromtimestamp(last_updated_ms / 1000)
# Check the database dialect and apply AUTOCOMMIT only for supported databases
if db.session.bind.dialect.name in ["postgresql", "mysql"]:
return (
db.session.query(Query)
.filter(Query.user_id == get_user_id(), Query.changed_on >= last_updated_dt)
.execution_options(isolation_level="AUTOCOMMIT") # Use AUTOCOMMIT for compatible databases
.all()
)
else:
return (
db.session.query(Query)
.filter(Query.user_id == get_user_id(), Query.changed_on >= last_updated_dt)
.all()
)
in this module -> superset/models/sql_lab.py , for the record this was generated by GPT but LGTM - could be more DRY though if we decided to open a PR with this idea.
Wanna try it and report-back?
BTW I double checked that we have a proper index on user_id, changed_on
to make sure that polling query is as fast as can be.
One idea that may help if you want to try it @mapledan
from sqlalchemy import text from sqlalchemy.engine import Connection @staticmethod def get_queries_changed_after(last_updated_ms: Union[float, int]) -> list[Query]: # UTC date time, same that is stored in the DB. last_updated_dt = datetime.utcfromtimestamp(last_updated_ms / 1000) # Check the database dialect and apply AUTOCOMMIT only for supported databases if db.session.bind.dialect.name in ["postgresql", "mysql"]: return ( db.session.query(Query) .filter(Query.user_id == get_user_id(), Query.changed_on >= last_updated_dt) .execution_options(isolation_level="AUTOCOMMIT") # Use AUTOCOMMIT for compatible databases .all() ) else: return ( db.session.query(Query) .filter(Query.user_id == get_user_id(), Query.changed_on >= last_updated_dt) .all() )
in this module -> superset/models/sql_lab.py , for the record this was generated by GPT but LGTM - could be more DRY though if we decided to open a PR with this idea.
Wanna try it and report-back?
BTW I double checked that we have a proper index on
user_id, changed_on
to make sure that polling query is as fast as can be.
Thanks for providing the information and assistance!
I tried adding execution_options(isolation_level="AUTOCOMMIT")
in superset/daos/query.py#68, but the error still persists.
However, after adjusting the following settings in config.py, everything started working as expected:
SQLALCHEMY_ENGINE_OPTIONS = {
"isolation_level": "READ COMMITTED",
}
It seems the issue might not be related to get_queries_changed_after
.
@mistercrunch @mapledan I believe there's an error on https://github.com/apache/superset/pull/28628:
# The default MySQL isolation level is REPEATABLE READ whereas the default PostgreSQL
# isolation level is READ COMMITTED. All backends should use READ COMMITTED (or similar)
# to help ensure consistent behavior.
SQLALCHEMY_ENGINE_OPTIONS = {
"isolation_level": "SERIALIZABLE", # SQLite does not support READ COMMITTED.
}
If you read the comment, it seems the isolation level should be set to READ COMMITTED
for both MySQL and Postgres and to SERIALIZABLE
for SQLite which does not support READ COMMITTED
.
In that same PR, there's a test with the following:
if make_url(SQLALCHEMY_DATABASE_URI).get_backend_name() in ("postgresql", "mysql"):
SQLALCHEMY_ENGINE_OPTIONS["isolation_level"] = "READ COMMITTED" # noqa: F405
@sadpandajoe added a comment about this but never got a reply.
In summary, I think the fix is to use READ COMMITTED
for both MySQL and Postgres and SERIALIZABLE
for SQLite.
@michael-s-molina your sqlite comment make sense, though now I'm confused. @mapledan reports that changing his Postgres to READ COMMITTED
fixes the issue, even though we're pointing to a PR that seems to set the default to READ COMMITTED
(?) Why was @mapledan 's isolation_level not set by this default (?) Did you have another setting?
@michael-s-molina your sqlite comment make sense, though now I'm confused. @mapledan reports that changing his Postgres to READ COMMITTED fixes the issue, even though we're pointing to a PR that seems to set the default to READ COMMITTED(?) Why was @mapledan 's isolation_level not set by this default (?) Did you have another setting?
The PR that I linked is NOT setting Postgres to READ COMMITTED
, it's setting to SERIALIZABLE
. I think the fix is to use READ COMMITTED
for both MySQL and Postgres and SERIALIZABLE
for SQLite.
I think it would be a better idea to provide sane defaults in the extensions of BaseEngineSpec
that get applied to SQLALCHEMY_ENGINE_OPTIONS
and can be overridden via superset_config.py
, rather than having a global default in config.py
that doesn't globally work..
I think it would be a better idea to provide sane defaults in the extensions of BaseEngineSpec that get applied to SQLALCHEMY_ENGINE_OPTIONS and can be overridden via superset_config.py, rather than having a global default in config.py that doesn't globally work..
@villebro That's what I meant with
I think the fix is to use READ COMMITTED for both MySQL and Postgres and SERIALIZABLE for SQLite.
A global configuration won't work because you may have multiple database connections with different transaction isolation levels.
Oh right! I had assumed given the title of the PR, but clearly sets a default that doesn't match the PR title. Good catch.
I think it would be a better idea to provide sane defaults in the extensions of BaseEngineSpec
Mmmh, I get the idea, but I'd say the db_engine_spec
abstraction is there to serve analytics-type workloads. I'd say for analytics workload on postgres you'd want AUTOCOMMIT, as typically you want as little as possible isolation overhead. Very different from OLTP. Say if you happen to use Postgres for both metadata db and analytics workload, you'd probably want READ COMMITTED
for the metadata db and AUTOCOMMIT for analytics...
I'd open PR with a dynamic default for config.py
, only specifying READ COMMITTED
for mysql
and postgres
, but realizing that at config.py
parsing-time we may not know what the database connection look like. Maybe using default logic in DB_CONNECTION_MUTATOR
(?). Or should we just not specify SQLALCHEMY_ENGINE_OPTIONS
in config.py
?
Submitting this DRAFT PR for review -> https://github.com/apache/superset/pull/30174
I think it would be a better idea to provide sane defaults in the extensions of BaseEngineSpec
Mmmh, I get the idea, but I'd say the
db_engine_spec
abstraction is there to serve analytics-type workloads. I'd say for analytics workload on postgres you'd want AUTOCOMMIT, as typically you want as little as possible isolation overhead. Very different from OLTP. Say if you happen to use Postgres for both metadata db and analytics workload, you'd probably wantREAD COMMITTED
for the metadata db and AUTOCOMMIT for analytics...I'd open PR with a dynamic default for
config.py
, only specifyingREAD COMMITTED
formysql
andpostgres
, but realizing that atconfig.py
parsing-time we may not know what the database connection look like. Maybe using default logic inDB_CONNECTION_MUTATOR
(?). Or should we just not specifySQLALCHEMY_ENGINE_OPTIONS
inconfig.py
?
Yeah you're absolutely right, my bad 👍 However, I'm still not entirely buying into having this in config.py
, but let me sleep on it and see if I'm wiser in the morning.
Mmmh, I get the idea, but I'd say the db_engine_spec abstraction is there to serve analytics-type workloads. I'd say for analytics workload on postgres you'd want AUTOCOMMIT, as typically you want as little as possible isolation overhead. Very different from OLTP. Say if you happen to use Postgres for both metadata db and analytics workload, you'd probably want READ COMMITTED for the metadata db and AUTOCOMMIT for analytics...
Good point 👍🏼
@mistercrunch @villebro Do you think this configuration could belong to the database connection modal? Maybe in the Performance
section?
It could be a select with SQL Alchemy options with the default value coming from the db engine spec. We could have a method called default_transaction_isolation_level
that will return a value or None
for databases that don't support isolation levels. The option to change the value would only appear if a value other than None
is received.
Database | Isolation Level |
---|---|
MySQL | REPEATABLE READ |
PostgreSQL | READ COMMITTED |
Oracle | READ COMMITTED |
SQL Server | READ COMMITTED |
SQLite | SERIALIZABLE |
Trino | None |
Druid | None |
Hive | READ COMMITTED |
Presto | READ COMMITTED |
This would allow us to change the isolation level for for both metadata and analytical databases.
do you think this configuration could belong to the database connection modal?
I think this setting matters most for the metadata database settings as it relates to OLTP-type workloads, that db connection is not exposed in the UI. For the metadata db, we should just pick a solid default (READ COMMITTED
) and stick to it for CI/tests/production (and to @john-bodley 's point, consistent across engines as much as possible). Also we should probably officially only support mysql
and postgres
, everything else is "use at your own risk". Wanna run it on Oracle or MSSQL? Good luck as we don't have licenses and can't blow up our build matrix.
It matters much less for the analytics-type workload, especially for single query per session that are just SELECT
s. With something like Airflow or dbt doing typically single INSERT/UPDATE statement per "session". There's rarely session/rollback management in your typical analytics database workload. Maybe at times you'd have some stored procedure type workload with ROLLBACK logic on the ETL side
I agree with @mistercrunch , IIUC the issue here is limited to the metastore connection, not analytical db connections (for which the modal is relevant). So I'm leaning towards having sane defaults that can be overridden if needed. I propose moving the discussion to Max's PR (I'll take a look now).
link to PR for convenience -> https://github.com/apache/superset/pull/30174
Bug description
In version 4.0.2, my sql_lab async tasks were functioning correctly with Celery. However, after upgrading to version 4.1.0rc2, I started encountering errors when executing sql_lab.get_sql_results. The tasks are now failing with a psycopg2.errors.SerializationFailure.
My databse is PostgreSQL 15. Here is my current super_config:
How to reproduce the bug
Screenshots/recordings
Superset version
master / latest-dev
Python version
3.10
Node version
16
Browser
Chrome
Additional context
No response
Checklist