apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
37.34k stars 14.34k forks source link

Mapped tasks marked as failed on scheduler restart #42107

Open michaelimas1 opened 2 months ago

michaelimas1 commented 2 months ago

Apache Airflow version

2.9.3

If "Other Airflow 2 version" selected, which one?

No response

What happened?

Observed behavior

some task instances are marked as failed without any log, as if they never started. When clearing the DAG run it succeeds so it seems like a temporary issue. It doesn't trigger the on_failure_callback which makes it very tricky to identify and fix in a timely manner.

364865428-c0821a7b-62a0-4acf-a735-b586f824a099

364867006-85507765-6e38-4572-bc33-72364cbb1607

Potential causes

We see a time correlation with scheduler pods termination, as part of deployment change (i.e. someone changed a custom operator or a module). Both pods were terminated around the time the DAG run marked some of the tasks as failed- i see that the last successful TIs are right after the pods termination. I do see however that 2 new pods were initialized ~5 mins before the "old" pods were terminated. Which should be enough time for them to adopt the newly orphaned tasks. Maybe I don't understand the expected behavior correctly.

The pods are constantly restarted and i discover this in delay (since there's no indication) so I don't have to the logs of the new schedulers but i can try to collect them if you think this is indeed a possible cause for this issue.

Helm deployment strategy (2 pods):

    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate

We run on Airflow v2.9.3 with Kubenetes executor

What you think should happen instead?

The tasks instances should have been executed

How to reproduce

I wasn't able to reproduce it by simply restarting the schedulers when a DAG with Mapped Tasks is running..

Operating System

Debian GNU/Linux 12

Versions of Apache Airflow Providers

No response

Deployment

Official Apache Airflow Helm Chart

Deployment details

No response

Anything else?

No response

Are you willing to submit PR?

Code of Conduct

boring-cyborg[bot] commented 2 months ago

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

kand617 commented 2 months ago

Im on 2.8.3 and can confirm similar behaviour. Suddenly I see a large chunk of scheduled map tasks be marked for failure. No logs or anything.

michaelimas1 commented 2 months ago

Another thing i can add which might be relevant - the mapped tasks are using a custom operator

eladkal commented 2 months ago

It will be difficult to debug without reproduce steps (including example dag)

michaelimas1 commented 2 months ago

@eladkal thanks, I'll keep trying to reproduce and update here if i get to it.

By looking at the result here, do you have maybe a rough estimation on what can be the area in the system this can be related to? It can help focusing the reproduction efforts