airflow-helm / charts

The User-Community Airflow Helm Chart is the standard way to deploy Apache Airflow on Kubernetes with Helm. Originally created in 2017, it has since helped thousands of companies create production-ready deployments of Airflow on Kubernetes.
https://github.com/airflow-helm/charts/tree/main/charts/airflow
Apache License 2.0
630 stars 474 forks source link

DB migrations are timing out after 1 second #840

Closed hpereira98 closed 3 months ago

hpereira98 commented 3 months ago

Checks

Chart Version

8.8.0

Kubernetes Version

Client Version: v1.29.0
Server Version: v1.28.6-eks-508b6b3

Helm Version

version.BuildInfo{Version:"v3.13.3", GitCommit:"c8b948945e52abba22ff885446a1486cb5fd3474", GitTreeState:"clean", GoVersion:"go1.21.5"}

Description

We're trying to upgrade Airflow from version 1.10.12 to 2.7.3. Locally, on a Minikube cluster and a local PostgreSQL database, the upgrade works as expected.

However, when trying to deploy it in a remote K8s cluster, connected to an AWS RDS database (PostgreSQL 16.2), the deployment does not work as the database migrations are timing out after 1 second.

After taking a look at the code, we could see that check_migrations is set by default to 1. We find it weird that no one has lifted this issue before - since the User-Community Airflow Chart does not allow us to configure this timeout value - as opposed to the official chart, where we can define images. migrationsWaitTimeout.

We've also tried configuring properties: "?sslmode=require" in the externalDatabase configs, but the same issue is occurring.

The issues doesn't seem to be related to the database connection, as the check-db step is running correctly, and check_migrations is correctly fetching the latest applied migration (da3f683c3a5a).

Can anyone help us understand this issue?

Relevant Logs

/home/airflow/.local/lib/python3.8/site-packages/airflow/config_templates/airflow_local_settings.py:193 DeprecationWarning: The remote_logging option in [core] has been moved to the remote_logging option in [logging] - the old setting has been used, but please update your config.
/home/airflow/.local/lib/python3.8/site-packages/airflow/config_templates/airflow_local_settings.py:206 DeprecationWarning: The remote_base_log_folder option in [core] has been moved to the remote_base_log_folder option in [logging] - the old setting has been used, but please update your config.
[2024-03-21T17:56:47.266+0000] {db.py:798} INFO - Waiting for migrations... 0 second(s)
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/db.py", line 822, in _configured_alembic_environment
    yield env
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/db.py", line 799, in check_migrations
    raise TimeoutError(
TimeoutError: There are still unapplied migrations after 1 seconds. MigrationHead(s) in DB: {'da3f683c3a5a'} | Migration Head(s) in Source Code: {'405de8318b3a'}
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1062, in _rollback_impl
    self.engine.dialect.do_rollback(self.connection)
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 683, in do_rollback
    dbapi_connection.rollback()
psycopg2.OperationalError: SSL connection has been closed unexpectedly
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/mnt/scripts/db_migrations.py", line 78, in <module>
    main(sync_forever=True)
  File "/mnt/scripts/db_migrations.py", line 52, in main
    if needs_db_migrations():
  File "/mnt/scripts/db_migrations.py", line 34, in needs_db_migrations
    check_migrations(1)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/db.py", line 799, in check_migrations
    raise TimeoutError(
  File "/usr/local/lib/python3.8/contextlib.py", line 131, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/db.py", line 822, in _configured_alembic_environment
    yield env
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 219, in __exit__
    self.close()
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/future/engine.py", line 246, in close
    super(Connection, self).close()
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1238, in close
    self._transaction.close()
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 2426, in close
    self._do_close()
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 2649, in _do_close
    self._close_impl()
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 2635, in _close_impl
    self._connection_rollback_impl()
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 2627, in _connection_rollback_impl
    self.connection._rollback_impl()
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1064, in _rollback_impl
    self._handle_dbapi_exception(e, None, None, None, None)
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 2134, in _handle_dbapi_exception
    util.raise_(
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 211, in raise_
    raise exception
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1062, in _rollback_impl
    self.engine.dialect.do_rollback(self.connection)
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 683, in do_rollback
    dbapi_connection.rollback()
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) SSL connection has been closed unexpectedly
(Background on this error at: https://sqlalche.me/e/14/e3q8)

Custom Helm Values

airflow:
  dbMigrations:
    enabled: true
  externalDatabase:
    type: "postgres"
    host: "<our host>"
    port: "<our port>"
    database: "<our database>"
    user: "<our user>"
    passwordSecret: "<our password secret>"
    passwordSecretKey: "<our password secret key>"
thesuperzapper commented 3 months ago

@hpereira98 it's not timing out after 1 second, the relevant error is psycopg2.OperationalError: SSL connection has been closed unexpectedly.

This indicates that your RDS instance is closing the connection for some reason.

After looking online, it's probably related to a lack of resources on the RDS, or some other configuration error like what this person on Reddit found (related to an invalid init_query).

hpereira98 commented 3 months ago

Yeah, this was actually an issue with our RDS database, where we had a parameter group setting idle_in_transaction_session_timeout to a value under 1s. Thanks for your help!