The serialized_dag table is cleared when every new instance is started

aws / amazon-mwaa-docker-images

Apache License 2.0

27 stars 12 forks source link

The serialized_dag table is cleared when every new instance is started #124

Closed millin closed 3 months ago

millin commented 4 months ago

Describe the bug The serialized_dag table is cleared when every new instance (e.g. worker) is started.

This results in the following errors:

killing a recently started task (zombie)
DagNotFound exception, when trigger_dag is calling
DAG <dag_id> seems to be missing from DagBag error in webserver
dag-processor error (https://github.com/apache/airflow/issues/40082)

The error occurs very often because when starting a new image (e.g. for new worker) each time the airflow db migrate command is called from entrypoint, which in turn runs reserialize_dags and clears the whole serialized_dag table.

Originally posted by @millin in https://github.com/apache/airflow/issues/40082#issuecomment-2253016337

rafidka commented 4 months ago

Thank you, @millin . I will try to take a look today and get back to you with what I find. I am surprised the serialized_dag is cleared with every call to db migrate. I thought it is basically a no-op in case the DB is already up to date.

millin commented 4 months ago

I am also surprised, however the defaults of reserialize_dags argument is True: https://github.com/apache/airflow/blob/f0ef69198ec0b7ad0c489cbccf76f6130445fedf/airflow/cli/cli_config.py#L659-L666

I can suggest a quick fix

--- await run_command("airflow db migrate", env=environ)
+++ await run_command("airflow db migrate --no-reserialize-dags", env=environ)

But I'm not sure it won't cause problems when updating the Airflow version

rafidka commented 4 months ago

Yeah, I also noticed the reserialize_dags argument. In addition to not being super comfortable about using an argument which is not documented, I also think the re-serialization is required if we are doing a version upgrade, which is possible with MWAA in case of version upgrade. Ideally, Airflow shouldn't be doing the re-serialization if the DB is already migrated, and we should probably report this as a bug in Airflow. However, for MWAA, we will probably have to change the code to do the check ourselves and avoid calling db migrate if the DB is already initialized.

rafidka commented 3 months ago

@millin , this should fix the issue you reported: https://github.com/aws/amazon-mwaa-docker-images/pull/125. I am currently out of office, but MWAA developers should pick the PR and merge it internally. Feel free to ping them if you don't hear any response, or alternatively reach out to AWS support to ensure the team stays on top of this issue, as it is pretty important and might impact multiple customers without them necessarily noticing.

Mercury2699 commented 3 months ago

The fix has been deployed to all regions. Customers can trigger environment update to receive the latest image.