apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
36.82k stars 14.24k forks source link

Zombie tasks detected after service restart sometimes do not retry #27657

Closed saulbein closed 11 months ago

saulbein commented 1 year ago

Apache Airflow version

Airflow 2.5.0

What happened

We're having an issue due to the fact that sometimes we need to restart Airflow services. Tasks that were running during the restart will not retry properly (with Airlfow 2.3.4 some would retry successfully, now that is no longer the case). I have not managed to figure out the cause of this, but I can provide logs from the investigation (see below).

Discussion #27071 seems to have a similar issue as well.

What you think should happen instead

All tasks should have been marked for retry and retried.

How to reproduce

I have not managed to reproduce this issue locally, every attempt it just retries the tasks as expected.

Operating System

Debian GNU/Linux 11 (bullseye)

Versions of Apache Airflow Providers

Airflow version: release:2.5.0+fa2bec042995004f45b914dd1d66b466ccced410 Provider versions (there are more providers used by DAGs, but they should not be relevant): apache-airflow-providers-amazon==7.0.0 apache-airflow-providers-celery==3.1.0 apache-airflow-providers-postgres==5.4.0

Deployment

Other Docker-based deployment

Deployment details

Setup with issues (runs on a k8s cluster):

Local test setup (same docker images used as the setup with issues):

Anything else

Notes:

Logs of when the issue happened (dag_1.task_a and dag_2.task_b both get interrupted by the deployment):

2023-01-23 13:27:33.022 | Dag Sync | DAG sync to volume starts
2023-01-23 13:27:37.613 | Dag Sync | dag_1 is copied to volume, but dag_2 is not changed so left as is
2023-01-23 13:27:42.661 | Dag Sync | DAG sync to volume finishes
2023-01-23 13:27:55.480 | Old Scheduler 1 | Scheduler 1 refreshes DAGs, start complaining about missing ones
2023-01-23 13:27:55.525 | Old Scheduler 1 | {dagbag.py:209} WARNING - Serialized DAG user_notification no longer exists
2023-01-23 13:27:55.525 | Old Scheduler 1 | {scheduler_job.py:970} ERROR - DAG 'user_notification' not found in serialized_dag table
2023-01-23 13:27:55.700 | Old Scheduler 2 | Scheduler 2 refreshes DAGs, start complaining about missing ones
2023-01-23 13:27:57.515 | Old Scheduler 1 | {scheduler_job.py:1276} ERROR - Couldn't find dag user_notification in DagBag/DB!
2023-01-23 13:27:57.541 | Old Scheduler 1 | {dagbag.py:209} WARNING - Serialized DAG derived_courier_onboarding no longer exists
2023-01-23 13:27:57.541 | Old Scheduler 1 | {scheduler_job.py:1276} ERROR - Couldn't find dag derived_courier_onboarding in DagBag/DB!
2023-01-23 13:27:57.670 | Old Scheduler 1 | {scheduler_job.py:970} ERROR - DAG 'user_notification' not found in serialized_dag table
2023-01-23 13:27:57.702 | Old Scheduler 1 | {scheduler_job.py:970} ERROR - DAG 'derived_courier_onboarding' not found in serialized_dag table
2023-01-23 13:27:58.888 | Old Scheduler 1 | {scheduler_job.py:1276} ERROR - Couldn't find dag user_notification in DagBag/DB!
2023-01-23 13:27:58.920 | Old Scheduler 1 | {scheduler_job.py:1276} ERROR - Couldn't find dag derived_courier_onboarding in DagBag/DB!
2023-01-23 13:27:59.052 | Old Scheduler 1 | {scheduler_job.py:970} ERROR - DAG 'user_notification' not found in serialized_dag table
2023-01-23 13:27:59.083 | Old Scheduler 1 | {scheduler_job.py:970} ERROR - DAG 'derived_courier_onboarding' not found in serialized_dag table
2023-01-23 13:28:10.776 | Old Scheduler 2 | {dagbag.py:209} WARNING - Serialized DAG user_notification no longer exists
2023-01-23 13:28:10.776 | Old Scheduler 2 | {scheduler_job.py:1276} ERROR - Couldn't find dag user_notification in DagBag/DB!
2023-01-23 13:28:10.800 | Old Scheduler 2 | {dagbag.py:209} WARNING - Serialized DAG derived_courier_onboarding no longer exists
2023-01-23 13:28:10.800 | Old Scheduler 2 | {scheduler_job.py:1276} ERROR - Couldn't find dag derived_courier_onboarding in DagBag/DB!
2023-01-23 13:28:10.913 | Old Scheduler 2 | {scheduler_job.py:970} ERROR - DAG 'user_notification' not found in serialized_dag table
2023-01-23 13:28:10.940 | Old Scheduler 2 | {scheduler_job.py:970} ERROR - DAG 'derived_courier_onboarding' not found in serialized_dag table
2023-01-23 13:28:12.640 | Old Scheduler 2 | Scheduler 2 shutdown
2023-01-23 13:28:12.663 | Old Scheduler 1 | Scheduler 1 shutdown
2023-01-23 13:28:12.673 | Old Webserver | Webserver shutdown
2023-01-23 13:28:12.678 | Worker | First worker shutdown
2023-01-23 13:28:13.433 | Worker | Last worker shutdown
2023-01-23 13:29:14.935 | Webserver | Webserver start
2023-01-23 13:29:52.984 | Scheduler 1 | Scheduler 1 start
2023-01-23 13:29:53.745 | Scheduler 1 | {celery_executor.py:533} INFO - Adopted the following 127 tasks from a dead executor
2023-01-23 13:29:53.745 | Scheduler 1 | <TaskInstance: dag_2.task_b scheduled__2023-01-23T11:00:00+00:00 [running]> in state SUCCESS
2023-01-23 13:29:53.745 | Scheduler 1 | <TaskInstance: dag_1.task_a scheduled__2023-01-23T11:00:00+00:00 [running]> in state STARTED
2023-01-23 13:29:54.249 | Scheduler 1 | {scheduler_job.py:602} INFO - Executor reports execution of dag_2.task_b run_id=scheduled__2023-01-23T11:00:00+00:00 exited with status success for try_number 1
2023-01-23 13:29:54.265 | Scheduler 1 | {scheduler_job.py:645} INFO - TaskInstance Finished: dag_id=dag_2, task_id=task_b, run_id=scheduled__2023-01-23T11:00:00+00:00, map_index=-1, run_start_date=2023-01-23 12:22:09.627743+00:00, run_end_date=None, run_duration=None, state=running, executor_state=success, try_number=1, max_tries=1, job_id=18337338, pool=default_pool, queue=default, priority_weight=1, operator=Operator, queued_dttm=2023-01-23 12:07:58.735041+00:00, queued_by_job_id=18337602, pid=53377
2023-01-23 13:30:09.331 | Scheduler 2 | Scheduler 2 start
2023-01-23 13:30:18.578 | Worker | First worker start
2023-01-23 13:30:34.754 | Worker | Last worker start
2023-01-23 13:33:38.914 | Scheduler 1 | {scheduler_job.py:1533} WARNING - Failing (78) jobs without heartbeat after 2023-01-23 13:28:38.901205+00:00
2023-01-23 13:33:39.046 | Scheduler 1 | {scheduler_job.py:1543} ERROR - Detected zombie job: {'full_filepath': '/usr/local/airflow/dags/dag_2.py', 'processor_subdir': '/usr/local/airflow/dags', 'msg': "{'DAG Id': 'dag_2', 'Task Id': 'task_b', 'Run Id': 'scheduled__2023-01-23T11:00:00+00:00', 'Hostname': '<hidden>', 'External Executor Id': 'cd9ce53b-233f-441d-af8a-9a8c7d66ff2c'}", 'simple_task_instance': <airflow.models.taskinstance.SimpleTaskInstance object at 0x7f5bb529f370>, 'is_failure_callback': True}
2023-01-23 13:33:39.165 | Scheduler 1 | {scheduler_job.py:1543} ERROR - Detected zombie job: {'full_filepath': '/usr/local/airflow/dags/dag_1.py', 'processor_subdir': None, 'msg': "{'DAG Id': 'dag_1', 'Task Id': 'task_a', 'Run Id': 'scheduled__2023-01-23T11:00:00+00:00', 'Hostname': '<hidden>', 'External Executor Id': 'd032a2e4-f981-46cd-9396-43b91cf2cf6e'}", 'simple_task_instance': <airflow.models.taskinstance.SimpleTaskInstance object at 0x7f5bb50bebb0>, 'is_failure_callback': True}
2023-01-23 13:33:49.804 | Scheduler 1 | {scheduler_job.py:1533} WARNING - Failing (45) jobs without heartbeat after 2023-01-23 13:28:49.794533+00:00
2023-01-23 13:33:49.879 | Scheduler 1 | {scheduler_job.py:1543} ERROR - Detected zombie job: {'full_filepath': '/usr/local/airflow/dags/dag_2.py', 'processor_subdir': '/usr/local/airflow/dags', 'msg': "{'DAG Id': 'dag_2', 'Task Id': 'task_b', 'Run Id': 'scheduled__2023-01-23T11:00:00+00:00', 'Hostname': '<hidden>', 'External Executor Id': 'cd9ce53b-233f-441d-af8a-9a8c7d66ff2c'}", 'simple_task_instance': <airflow.models.taskinstance.SimpleTaskInstance object at 0x7f5bb50301f0>, 'is_failure_callback': True}
2023-01-23 13:33:49.954 | Scheduler 1 | {scheduler_job.py:1543} ERROR - Detected zombie job: {'full_filepath': '/usr/local/airflow/dags/dag_1.py', 'processor_subdir': None, 'msg': "{'DAG Id': 'dag_1', 'Task Id': 'task_a', 'Run Id': 'scheduled__2023-01-23T11:00:00+00:00', 'Hostname': '<hidden>', 'External Executor Id': 'd032a2e4-f981-46cd-9396-43b91cf2cf6e'}", 'simple_task_instance': <airflow.models.taskinstance.SimpleTaskInstance object at 0x7f5bb51715e0>, 'is_failure_callback': True}
2023-01-23 13:34:00.440 | Scheduler 1 | {scheduler_job.py:1533} WARNING - Failing (37) jobs without heartbeat after 2023-01-23 13:29:00.428299+00:00
2023-01-23 13:34:00.539 | Scheduler 1 | {scheduler_job.py:1543} ERROR - Detected zombie job: {'full_filepath': '/usr/local/airflow/dags/dag_2.py', 'processor_subdir': '/usr/local/airflow/dags', 'msg': "{'DAG Id': 'dag_2', 'Task Id': 'task_b', 'Run Id': 'scheduled__2023-01-23T11:00:00+00:00', 'Hostname': '<hidden>', 'External Executor Id': 'cd9ce53b-233f-441d-af8a-9a8c7d66ff2c'}", 'simple_task_instance': <airflow.models.taskinstance.SimpleTaskInstance object at 0x7f5bb4ae0d00>, 'is_failure_callback': True}
2023-01-23 13:34:00.644 | Scheduler 1 | {scheduler_job.py:1543} ERROR - Detected zombie job: {'full_filepath': '/usr/local/airflow/dags/dag_1.py', 'processor_subdir': None, 'msg': "{'DAG Id': 'dag_1', 'Task Id': 'task_a', 'Run Id': 'scheduled__2023-01-23T11:00:00+00:00', 'Hostname': '<hidden>', 'External Executor Id': 'd032a2e4-f981-46cd-9396-43b91cf2cf6e'}", 'simple_task_instance': <airflow.models.taskinstance.SimpleTaskInstance object at 0x7f5bb4b3f9d0>, 'is_failure_callback': True}
2023-01-23 13:34:11.128 | Scheduler 1 | {scheduler_job.py:1533} WARNING - Failing (20) jobs without heartbeat after 2023-01-23 13:29:11.121027+00:00
2023-01-23 13:34:11.203 | Scheduler 1 | {scheduler_job.py:1543} ERROR - Detected zombie job: {'full_filepath': '/usr/local/airflow/dags/dag_2.py', 'processor_subdir': '/usr/local/airflow/dags', 'msg': "{'DAG Id': 'dag_2', 'Task Id': 'task_b', 'Run Id': 'scheduled__2023-01-23T11:00:00+00:00', 'Hostname': '<hidden>', 'External Executor Id': 'cd9ce53b-233f-441d-af8a-9a8c7d66ff2c'}", 'simple_task_instance': <airflow.models.taskinstance.SimpleTaskInstance object at 0x7f5bb52adf40>, 'is_failure_callback': True}
2023-01-23 13:34:11.244 | Scheduler 1 | {scheduler_job.py:1543} ERROR - Detected zombie job: {'full_filepath': '/usr/local/airflow/dags/dag_1.py', 'processor_subdir': None, 'msg': "{'DAG Id': 'dag_1', 'Task Id': 'task_a', 'Run Id': 'scheduled__2023-01-23T11:00:00+00:00', 'Hostname': '<hidden>', 'External Executor Id': 'd032a2e4-f981-46cd-9396-43b91cf2cf6e'}", 'simple_task_instance': <airflow.models.taskinstance.SimpleTaskInstance object at 0x7f5bb4e90790>, 'is_failure_callback': True}
2023-01-23 13:34:21.533 | Scheduler 1 | {scheduler_job.py:1533} WARNING - Failing (15) jobs without heartbeat after 2023-01-23 13:29:21.525371+00:00
2023-01-23 13:34:21.635 | Scheduler 1 | {scheduler_job.py:1543} ERROR - Detected zombie job: {'full_filepath': '/usr/local/airflow/dags/dag_2.py', 'processor_subdir': '/usr/local/airflow/dags', 'msg': "{'DAG Id': 'dag_2', 'Task Id': 'task_b', 'Run Id': 'scheduled__2023-01-23T11:00:00+00:00', 'Hostname': '<hidden>', 'External Executor Id': 'cd9ce53b-233f-441d-af8a-9a8c7d66ff2c'}", 'simple_task_instance': <airflow.models.taskinstance.SimpleTaskInstance object at 0x7f5bb4d91df0>, 'is_failure_callback': True}
2023-01-23 13:34:21.660 | Scheduler 1 | {scheduler_job.py:1543} ERROR - Detected zombie job: {'full_filepath': '/usr/local/airflow/dags/dag_1.py', 'processor_subdir': None, 'msg': "{'DAG Id': 'dag_1', 'Task Id': 'task_a', 'Run Id': 'scheduled__2023-01-23T11:00:00+00:00', 'Hostname': '<hidden>', 'External Executor Id': 'd032a2e4-f981-46cd-9396-43b91cf2cf6e'}", 'simple_task_instance': <airflow.models.taskinstance.SimpleTaskInstance object at 0x7f5bb4b2ffd0>, 'is_failure_callback': True}
2023-01-23 13:34:28.165 | Scheduler 2 | {dagrun.py:585} ERROR - Marking run <DagRun dag_1 @ 2023-01-23 11:00:00+00:00: scheduled__2023-01-23T11:00:00+00:00, state:running, queued_at: 2023-01-23 12:24:42.247390+00:00. externally triggered: False> failed
2023-01-23 13:34:28.176 | Scheduler 2 | {dagrun.py:657} INFO - DagRun Finished: dag_id=dag_1, execution_date=2023-01-23 11:00:00+00:00, run_id=scheduled__2023-01-23T11:00:00+00:00, run_start_date=2023-01-23 12:24:42.270358+00:00, run_end_date=2023-01-23 13:34:28.175002+00:00, run_duration=4185.904644, state=failed, external_trigger=False, run_type=scheduled, data_interval_start=2023-01-23 11:00:00+00:00, data_interval_end=2023-01-23 12:00:00+00:00, dag_hash=220c226f9f417a8816d708306928bef5
2023-01-23 13:34:28.184 | Scheduler 2 | {dag.py:3423} INFO - Setting next_dagrun for dag_1 to 2023-01-23T12:00:00+00:00, run_after=2023-01-23T13:00:00+00:00
2023-01-23 13:34:28.827 | Scheduler 1 | {scheduler_job.py:1197} INFO - DAG dag_1 is at (or above) max_active_runs (2 of 1), not creating any more runs
2023-01-23 13:34:28.849 | Scheduler 1 | {dagrun.py:585} ERROR - Marking run <DagRun dag_1 @ 2023-01-23 11:00:00+00:00: scheduled__2023-01-23T11:00:00+00:00, state:running, queued_at: 2023-01-23 12:24:42.247390+00:00. externally triggered: False> failed
2023-01-23 13:34:28.855 | Scheduler 1 | {dagrun.py:657} INFO - DagRun Finished: dag_id=dag_1, execution_date=2023-01-23 11:00:00+00:00, run_id=scheduled__2023-01-23T11:00:00+00:00, run_start_date=2023-01-23 12:24:42.270358+00:00, run_end_date=2023-01-23 13:34:28.854957+00:00, run_duration=4186.584599, state=failed, external_trigger=False, run_type=scheduled, data_interval_start=2023-01-23 11:00:00+00:00, data_interval_end=2023-01-23 12:00:00+00:00, dag_hash=220c226f9f417a8816d708306928bef5
2023-01-23 13:34:28.859 | Scheduler 1 | {scheduler_job.py:1197} INFO - DAG dag_1 is at (or above) max_active_runs (1 of 1), not creating any more runs
2023-01-23 13:34:29.517 | Scheduler 2 | {dagrun.py:585} ERROR - Marking run <DagRun dag_1 @ 2023-01-23 11:00:00+00:00: scheduled__2023-01-23T11:00:00+00:00, state:running, queued_at: 2023-01-23 12:24:42.247390+00:00. externally triggered: False> failed
2023-01-23 13:34:29.523 | Scheduler 2 | {dagrun.py:657} INFO - DagRun Finished: dag_id=dag_1, execution_date=2023-01-23 11:00:00+00:00, run_id=scheduled__2023-01-23T11:00:00+00:00, run_start_date=2023-01-23 12:24:42.270358+00:00, run_end_date=2023-01-23 13:34:29.523140+00:00, run_duration=4187.252782, state=failed, external_trigger=False, run_type=scheduled, data_interval_start=2023-01-23 11:00:00+00:00, data_interval_end=2023-01-23 12:00:00+00:00, dag_hash=220c226f9f417a8816d708306928bef5
2023-01-23 13:34:29.529 | Scheduler 2 | {scheduler_job.py:1197} INFO - DAG dag_1 is at (or above) max_active_runs (1 of 1), not creating any more runs
2023-01-23 13:34:31.902 | Scheduler 1 | {scheduler_job.py:1533} WARNING - Failing (12) jobs without heartbeat after 2023-01-23 13:29:31.894820+00:00
2023-01-23 13:34:31.972 | Scheduler 1 | {scheduler_job.py:1543} ERROR - Detected zombie job: {'full_filepath': '/usr/local/airflow/dags/dag_2.py', 'processor_subdir': '/usr/local/airflow/dags', 'msg': "{'DAG Id': 'dag_2', 'Task Id': 'task_b', 'Run Id': 'scheduled__2023-01-23T11:00:00+00:00', 'Hostname': '<hidden>', 'External Executor Id': 'cd9ce53b-233f-441d-af8a-9a8c7d66ff2c'}", 'simple_task_instance': <airflow.models.taskinstance.SimpleTaskInstance object at 0x7f5bb53153d0>, 'is_failure_callback': True}
2023-01-23 13:34:36.469 | Scheduler 2 | {dagrun.py:585} ERROR - Marking run <DagRun dag_2 @ 2023-01-23 11:00:00+00:00: scheduled__2023-01-23T11:00:00+00:00, state:running, queued_at: 2023-01-23 12:00:00.853024+00:00. externally triggered: False> failed
2023-01-23 13:34:36.475 | Scheduler 2 | {dagrun.py:657} INFO - DagRun Finished: dag_id=dag_2, execution_date=2023-01-23 11:00:00+00:00, run_id=scheduled__2023-01-23T11:00:00+00:00, run_start_date=2023-01-23 12:00:01.017868+00:00, run_end_date=2023-01-23 13:34:36.475599+00:00, run_duration=5675.457731, state=failed, external_trigger=False, run_type=scheduled, data_interval_start=2023-01-23 11:00:00+00:00, data_interval_end=2023-01-23 12:00:00+00:00, dag_hash=4929c5c37dcf9a6391aca2787ee5ff73
2023-01-23 13:34:36.483 | Scheduler 2 | {dag.py:3423} INFO - Setting next_dagrun for dag_2 to 2023-01-23T12:00:00+00:00, run_after=2023-01-23T13:00:00+00:00
2023-01-23 13:34:36.757 | Scheduler 1 | {dag.py:3423} INFO - Setting next_dagrun for dag_2 to 2023-01-23T13:00:00+00:00, run_after=2023-01-23T14:00:00+00:00
2023-01-23 13:34:38.098 | Scheduler 1 | {dagrun.py:585} ERROR - Marking run <DagRun dag_2 @ 2023-01-23 11:00:00+00:00: scheduled__2023-01-23T11:00:00+00:00, state:running, queued_at: 2023-01-23 12:00:00.853024+00:00. externally triggered: False> failed
2023-01-23 13:34:38.106 | Scheduler 1 | {dagrun.py:657} INFO - DagRun Finished: dag_id=dag_2, execution_date=2023-01-23 11:00:00+00:00, run_id=scheduled__2023-01-23T11:00:00+00:00, run_start_date=2023-01-23 12:00:01.017868+00:00, run_end_date=2023-01-23 13:34:38.106316+00:00, run_duration=5677.088448, state=failed, external_trigger=False, run_type=scheduled, data_interval_start=2023-01-23 11:00:00+00:00, data_interval_end=2023-01-23 12:00:00+00:00, dag_hash=4929c5c37dcf9a6391aca2787ee5ff73
2023-01-23 13:34:38.111 | Scheduler 1 | {dag.py:3423} INFO - Setting next_dagrun for dag_2 to 2023-01-23T12:00:00+00:00, run_after=2023-01-23T13:00:00+00:00
2023-01-23 13:34:38.859 | Scheduler 2 | {dagrun.py:585} ERROR - Marking run <DagRun dag_2 @ 2023-01-23 11:00:00+00:00: scheduled__2023-01-23T11:00:00+00:00, state:running, queued_at: 2023-01-23 12:00:00.853024+00:00. externally triggered: False> failed
2023-01-23 13:34:38.866 | Scheduler 2 | {dagrun.py:657} INFO - DagRun Finished: dag_id=dag_2, execution_date=2023-01-23 11:00:00+00:00, run_id=scheduled__2023-01-23T11:00:00+00:00, run_start_date=2023-01-23 12:00:01.017868+00:00, run_end_date=2023-01-23 13:34:38.866236+00:00, run_duration=5677.848368, state=failed, external_trigger=False, run_type=scheduled, data_interval_start=2023-01-23 11:00:00+00:00, data_interval_end=2023-01-23 12:00:00+00:00, dag_hash=4929c5c37dcf9a6391aca2787ee5ff73
2023-01-23 13:34:38.873 | Scheduler 2 | {dag.py:3423} INFO - Setting next_dagrun for dag_2 to 2023-01-23T12:00:00+00:00, run_after=2023-01-23T13:00:00+00:00
2023-01-23 13:39:33.279 | Worker | [2023-01-23 13:39:33,279: INFO/ForkPoolWorker-1] Running <TaskInstance: dag_1.task_a scheduled__2023-01-23T11:00:00+00:00 [failed]> on host 100.107.103.63
2023-01-23 13:39:34.815 | Scheduler 1 | {scheduler_job.py:602} INFO - Executor reports execution of dag_1.task_a run_id=scheduled__2023-01-23T11:00:00+00:00 exited with status success for try_number 1
2023-01-23 13:39:34.820 | Scheduler 1 | {scheduler_job.py:645} INFO - TaskInstance Finished: dag_id=dag_1, task_id=task_a, run_id=scheduled__2023-01-23T11:00:00+00:00, map_index=-1, run_start_date=2023-01-23 12:37:08.193714+00:00, run_end_date=2023-01-23 13:34:28.496570+00:00, run_duration=3440.302856, state=failed, executor_state=success, try_number=1, max_tries=1, job_id=18337425, pool=default_pool, queue=default, priority_weight=1, operator=Operator, queued_dttm=2023-01-23 12:25:04.455274+00:00, queued_by_job_id=18337602, pid=37044

Are you willing to submit PR?

Code of Conduct

boring-cyborg[bot] commented 1 year ago

Thanks for opening your first issue here! Be sure to follow the issue template!

potiuk commented 1 year ago

Is it possible that yoy upgrade to latest version of Airflow relased and test it @saulbein - we had a number of stability fixes in 2.4 (and 2.5.0 is ~already~ about to be released) and while I understand upgrading in the hopes of fixing an issue might not seem like justified, it often happens that the issues get fixed and some similar /vaguely relevant issues get fixed. We have reports telling that a number of stuck/queue problems have been fixed by some of the changes, and the best way in case of the issue that cannot be easily reproduced is to try the latest version. Doing it is rather straightforward and in case you have the issue happening relatively quickly, it might give a very quick feedback without spending a lot of time to diagnose it.

And in any case - we only release fixes in the latest "minor" verision - so upgrading is anyhow the only possible way to fix (so the effort to upgrade is not lost - upgrade has to be performed anyway).

Can you please help with that @saulbein ?

saulbein commented 1 year ago

Alright, I'll see if we can upgrade this or next week. I don't see 2.5.0 in the releases though?

potiuk commented 1 year ago

right: "it about to be released" it should be - we have just cancelled voting for RC1 and there is an RC2 coming today (will take 72+ hours of testing/voting at least to make it into official release).

But if you can upgrade to 2.4.3 - you can do it now (and release to 2.5.0 once it is out and some reports showing that thigns are good by the "bleeding edge" people. I am always repeating this - things like upgrades should be done more frequently rather than less - it makes the upgrade process far less painful overall

potiuk commented 1 year ago

The 2.5.0 is relased now. Try it now and see if you can reproduce the problem there @saulbein (and post stacktraces/logs etc. if you do please).

saulbein commented 1 year ago

Due to my company's christmas development slowdown I don't think I'll be able to get the upgrade to 2.5.0 done until some time early next year. I'll still check on 2.4.3, maybe the issue was fixed in that version already.

potiuk commented 1 year ago

We can wait, no worries.

saulbein commented 1 year ago

Updated to 2.5.0 (unfortunately right before 2.5.1). updated the issue description with the current providers that we use and the log itself. Now we don't really get tasks restarting anymore, anything that is running during shutdown gets killed and doesn't attempt to retry.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has been open for 30 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.

saulbein commented 1 year ago

bump, still needs investigation - if I should provide more info, let me know

potiuk commented 1 year ago

@eladkal -> I added needs-triage - I guess this is precisely the case where we should add it back?

eladkal commented 1 year ago

@eladkal -> I added needs-triage - I guess this is precisely the case where we should add it back?

ho yeah... I thought i added it :facepalm: Now that the report is on latest Airflow version we need to triage/validate if we have enough information to check this issue or we should ask for further info from the author. So yeah we need to label so this issue is on the queue and waiting for triage

hussein-awala commented 1 year ago

@saulbein I'm curious if you deleted the DAG files or performed any synchronization during the restart.

ERROR - DAG '...' not found in serialized_dag table

It seems like the DAG files are missing from the dagbag. In this situation, the scheduler won't be able to schedule the failed tasks or queued them.

saulbein commented 1 year ago

@saulbein I'm curious if you deleted the DAG files or performed any synchronization during the restart.

ERROR - DAG '...' not found in serialized_dag table

It seems like the DAG files are missing from the dagbag. In this situation, the scheduler won't be able to schedule the failed tasks or queued them.

No deletes, just replacing the files that were modified. We even switched to git sync and still get the ... not found in serialized_dag table errors when restarting the services (sadly can't easily not restart them for now).

Besides, shouldn't the scheduler be able to schedule them after they get parsed? Or does it fail once and then give up?

jscheffl commented 1 year ago

This error report is stale since a while, I tried to follow the discussion to clean up the issue backlog. I assume under the current release 2.7.2 there had been many changes and we would need to refresh the error report.

One thing that I am curious from the last message is that DAG file are being replaced (during restart?). I assume we should consider for the bug that "only" a restart of the Airflow system (assumption: including DB and worker?) is made. Or do you just restart the scheduler? Is DAG Parsing within the scheduler process? If there are also DAG parsing problems reported, it might be caused by bad DAG parsing performance as well. If DAG parsing takes too long, DAGs not parsed for a longer time are dropped from the database. That might be a side effect causing this message.

Can you maybe craft more details about the tasks you execute or implement a dummy DAG in the error report, which shows the error and try to re-produce on Airflow 2.7.2?

github-actions[bot] commented 11 months ago

This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.

github-actions[bot] commented 11 months ago

This issue has been closed because it has not received response from the issue author.