Open michaelimas1 opened 2 months ago
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.
Im on 2.8.3 and can confirm similar behaviour. Suddenly I see a large chunk of scheduled map tasks be marked for failure. No logs or anything.
Another thing i can add which might be relevant - the mapped tasks are using a custom operator
It will be difficult to debug without reproduce steps (including example dag)
@eladkal thanks, I'll keep trying to reproduce and update here if i get to it.
By looking at the result here, do you have maybe a rough estimation on what can be the area in the system this can be related to? It can help focusing the reproduction efforts
Apache Airflow version
2.9.3
If "Other Airflow 2 version" selected, which one?
No response
What happened?
Observed behavior
some task instances are marked as failed without any log, as if they never started. When clearing the DAG run it succeeds so it seems like a temporary issue. It doesn't trigger the on_failure_callback which makes it very tricky to identify and fix in a timely manner.
Potential causes
We see a time correlation with scheduler pods termination, as part of deployment change (i.e. someone changed a custom operator or a module). Both pods were terminated around the time the DAG run marked some of the tasks as failed- i see that the last successful TIs are right after the pods termination. I do see however that 2 new pods were initialized ~5 mins before the "old" pods were terminated. Which should be enough time for them to adopt the newly orphaned tasks. Maybe I don't understand the expected behavior correctly.
The pods are constantly restarted and i discover this in delay (since there's no indication) so I don't have to the logs of the new schedulers but i can try to collect them if you think this is indeed a possible cause for this issue.
Helm deployment strategy (2 pods):
We run on Airflow v2.9.3 with Kubenetes executor
What you think should happen instead?
The tasks instances should have been executed
How to reproduce
I wasn't able to reproduce it by simply restarting the schedulers when a DAG with Mapped Tasks is running..
Operating System
Debian GNU/Linux 12
Versions of Apache Airflow Providers
No response
Deployment
Official Apache Airflow Helm Chart
Deployment details
No response
Anything else?
No response
Are you willing to submit PR?
Code of Conduct