Open amyshields opened 2 weeks ago
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.
retry_delay under 30 seconds is risky since airflow with a distributed/remote/edge executor like CeleryExecutor is eventually consistent
retry_delay under 30 seconds is risky since airflow with a distributed/remote/edge executor like CeleryExecutor is eventually consistent
Sorry I am not sure I understand what you mean here (@raphaelauv), what is retry_delay? Is this something we control? How is it used?
I could imagine a delay is happening because you have some infrastructure flakiness. 5 Minutes "smell" a bit like the typical heartbeat timeout that a task sends to the DB that it is still alive.
Can you check the logs/stdout of the worker where the task is executing? Might be some errors are printed to stdout which are not getting picked-up by the log facility.
The case you describe "should not happen" but I doubt it is a systematic problem, rather a infrastructure problem.
@amyshields
you said we are immediately retrying failed tasks
the default retry_delay https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#default-task-retry-delay is 300 seconds
Apache Airflow version
2.9.3
If "Other Airflow 2 version" selected, which one?
No response
What happened?
We have seen this issue several times.
FAILED
upstream_failed
It is important to note, we also see this behaviour for a task succeeding (not being reflected in Airflow UI or its metadata DB).
We have validated this by also making a call to Airflow's API to retrieve the task instance & the state has not been reflected as we would expect.
This exact case happened today (30th Aug) with a 2 minute delay:
What you think should happen instead?
FAILED
upstream_failed
We do not expect any delay in the task being marked with its appropriate state nor the marking of any downstreams.
How to reproduce
This is hard to reproduce as unfortunately the metadata db (task instance table) only ever stores the latest state of a task (to minimize production downtime we are immediately retrying failed tasks and then subsequently will succeed and we dont get the first state stored). Possibly cold look into insertion timestamps and task completion timestamp and look at the delay here.
Operating System
linux/arm64
Versions of Apache Airflow Providers
No response
Deployment
Other Docker-based deployment
Deployment details
We use this docker image: apache/airflow:2.9.3-python3.9
Anything else?
No response
Are you willing to submit PR?
Code of Conduct