apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
37.43k stars 14.36k forks source link

Airflow database migration version scripts are unsafe, have failed, and are guaranteed to fail in the future #25209

Closed vitaly-krugl closed 2 years ago

vitaly-krugl commented 2 years ago

Apache Airflow version

2.3.2

What happened

Airflow uses unsafe practice in the implementation of multiple migration version scripts, including some recent ones that are guaranteed to fail in the future. Multiple migration version scripts import from Airflow itself, including imports of models.

The problem with this approach is that while the logic implemented by those imports at the time when the version script is created works fine, this may no longer be the case some number of releases later, because the implementation of those Airflow imports will change, so that when those functions/methods/classes are executed in the future (when migrating from an older version of airflow to a newer one), the new implementation will no longer match the state of the user's database which is being upgraded.

For example, if an older migration version script attempts to load a Model from database using the code of the target - newer - version of Airflow code, the fetching of the Model will if the new model class contains columns that didn't yet exist in the version of the system that is being upgraded by the user.

This unsafe practice is used in many version scripts, including some of the most recent ones:

$ git grep  import | grep airflow | grep model
db_types.py:    from airflow.models.base import StringID
env.py:from airflow import models, settings
versions/0023_1_8_2_add_max_tries_column_to_task_instance.py:from airflow.models import DagBag
versions/0054_1_10_10_add_dag_code_table.py:from airflow.models.dagcode import DagCode
versions/0061_2_0_0_increase_length_of_pool_name.py:from airflow.models.base import COLLATION_ARGS
versions/0064_2_0_0_add_unique_constraint_to_conn_id.py:from airflow.models.base import COLLATION_ARGS
versions/0073_2_0_0_prefix_dag_permissions.py:from airflow.www.fab_security.sqla.models import Action, Permission, Resource
versions/0092_2_2_0_add_data_interval_start_end_to_dagmodel_and_dagrun.py:from airflow.migrations.db_types import TIMESTAMP
versions/0099_2_3_0_add_task_log_filename_template_model.py:from airflow.migrations.utils import disable_sqlite_fkeys
versions/0099_2_3_0_add_task_log_filename_template_model.py:from airflow.utils.sqlalchemy import UtcDateTime
versions/0100_2_3_0_add_taskmap_and_map_id_on_taskinstance.py:from airflow.models.base import StringID

Please note that this issue is not purely theoretical! Here is an actual example from a production system: #25075. Please don't invalidate the current ticket due to the example being from an older version of Airflow because the same error-prone practice is found in more recent migration version scripts, so this type of failure is guaranteed to happen again.

What you think should happen instead

The implementations of migration version scripts MUST NOT make use Airflow functions/methods/classes - such as Models - whose implementations could change in the future (as it did in the reported issue #25075) - such as columns having been renamed/added/removed in more recent versions - resulting in guaranteed database migration failures in the future.

Also, to ensure robust database migration, Airflow maintainers - at a minimum - need to automate testing of migrations from various versions starting with a database where all tables are populated. Such test code needs to test migrations/upgrades starting from tables populated by different versions of airflow to ensure that migrations are safe and reliable.

How to reproduce

See issue #25075 as an example of reproducing. Please don't invalidate the current ticket due to the example being from an older version of Airflow because the same error-prone practice is found in more recent migration version scripts, so this type of failure is guaranteed to happen again if the existing unsafe migration version script code isn't fixed and the unsafe practice is not eradicated.

Operating System

OSX and Linux

Versions of Apache Airflow Providers

apache-airflow-providers-ftp==2.1.2 apache-airflow-providers-http==2.1.2 apache-airflow-providers-imap==2.2.3 apache-airflow-providers-sqlite==2.1.3

Deployment

Virtualenv installation

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

Code of Conduct

uranusjr commented 2 years ago

Of the examples you provided, only migrations 23, 54 and 73 a problematic (and you can see they are all fairly old). The rest are constants and helper functions that are intended to be used in migrations and guaranteed to not change. We should probably move them to somewhere else and add dev documentation to make this clear, but using them is not a problem on itself.

As for the actual problematic ones, feel free to fix them.

vitaly-krugl commented 2 years ago

One of the outcomes of this issue, besides fixing them and having adequate tests, is having guidelines in a migration/versions/README, or something like that, which discusses the unsafe practices to avoid. This would help maintainers and reviewers alike.

On Wed, Jul 20, 2022, 8:55 PM Tzu-ping Chung @.***> wrote:

Of the examples you provided, only migrations 23, 54 and 73 a problematic (and you can see they are all fairly old). The rest are constants and helper functions that are intended to be used in migrations and guaranteed to not change. We should probably move them to somewhere else and add dev documentation to make this clear, but using them is not a problem on itself.

As for the actual problematic ones, feel free to fix them.

— Reply to this email directly, view it on GitHub https://github.com/apache/airflow/issues/25209#issuecomment-1190916431, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAK72KSEPXX7QPJK23U22PLVVCNXJANCNFSM54FPL3FA . You are receiving this because you authored the thread.Message ID: @.***>

potiuk commented 2 years ago

@vitaly-krugl - if you would like to propose some guidelines - feel free to make PR. Like everything else in Airflow.,

Airflow is created by > 2100 contributors and if you would lilke to contribute such README and guidelines. That's cool. Seems your experience and ideas and cool and it's the best you ca do to help the community to contribute a PR with such quidelines.

Also I would lilke to draw your attention, to the fact that we tend not only to raise problems when we notice them, and document them in READMES but also automate most of such checks. So if you are able to come up with a proposal on how to automate such checks and fail PRs that introduce bad practices, adding a pre-commit that performs such checks is ideal. We have already ~ 100 pre-commits like this, so you are most welcome to devlop and submit pre-commit that performs such checks. That would be fantastic contribution.

BTW. Converting this into discussion - this is not a bug "per se" in airflow. It's the usual (we had probably hundreds of them) discussions on how we can improve our dev environment - and usually each such discussion ended up as a PR from those who raised it, rather than "bug" issue.