apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
36.31k stars 14.09k forks source link

Delay in marking a tasks state (success/upstream_failed and failed) #41884

Open amyshields opened 2 weeks ago

amyshields commented 2 weeks ago

Apache Airflow version

2.9.3

If "Other Airflow 2 version" selected, which one?

No response

What happened?

We have seen this issue several times.

  1. A task failed Up to 5 minutes go by (this is the longest we have seen the wait)
  2. The task itself is marked as FAILED
  3. All downstream tasks are marked as upstream_failed

It is important to note, we also see this behaviour for a task succeeding (not being reflected in Airflow UI or its metadata DB).

We have validated this by also making a call to Airflow's API to retrieve the task instance & the state has not been reflected as we would expect.

This exact case happened today (30th Aug) with a 2 minute delay:

  1. task_A failed today at 7:22 BST - Screenshot 2024-08-30 at 09 29 15
  2. One of its downstreams is in a None state at 7:23:00am BST Screenshot 2024-08-30 at 09 24 33
  3. Then the downstream is set to a upstream failed state at 7:25am BST Screenshot 2024-08-30 at 09 31 18

What you think should happen instead?

  1. A task failed
  2. The task itself is marked as FAILED
  3. All downstream tasks are marked as upstream_failed

We do not expect any delay in the task being marked with its appropriate state nor the marking of any downstreams.

How to reproduce

This is hard to reproduce as unfortunately the metadata db (task instance table) only ever stores the latest state of a task (to minimize production downtime we are immediately retrying failed tasks and then subsequently will succeed and we dont get the first state stored). Possibly cold look into insertion timestamps and task completion timestamp and look at the delay here.

Operating System

linux/arm64

Versions of Apache Airflow Providers

No response

Deployment

Other Docker-based deployment

Deployment details

We use this docker image: apache/airflow:2.9.3-python3.9

Anything else?

No response

Are you willing to submit PR?

Code of Conduct

boring-cyborg[bot] commented 2 weeks ago

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

raphaelauv commented 2 weeks ago

retry_delay under 30 seconds is risky since airflow with a distributed/remote/edge executor like CeleryExecutor is eventually consistent

amyshields commented 2 weeks ago

retry_delay under 30 seconds is risky since airflow with a distributed/remote/edge executor like CeleryExecutor is eventually consistent

Sorry I am not sure I understand what you mean here (@raphaelauv), what is retry_delay? Is this something we control? How is it used?

jscheffl commented 2 weeks ago

I could imagine a delay is happening because you have some infrastructure flakiness. 5 Minutes "smell" a bit like the typical heartbeat timeout that a task sends to the DB that it is still alive.

Can you check the logs/stdout of the worker where the task is executing? Might be some errors are printed to stdout which are not getting picked-up by the log facility.

The case you describe "should not happen" but I doubt it is a systematic problem, rather a infrastructure problem.

raphaelauv commented 2 weeks ago

@amyshields you said we are immediately retrying failed tasks

the default retry_delay https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#default-task-retry-delay is 300 seconds