Open wolfier opened 1 year ago
You are right. My vote is for a more generic message.
My preference would be to have two distinct error messages for the two different conditions. One seems far more common than the other.
My preference would be to have two distinct error messages for the two different conditions. One seems far more common than the other.
That's better but It doesn't look like it's gonna be easy to get the two distinct error messages due to the query there
This issue has been automatically marked as stale because it has been open for 365 days without any activity. There has been several Airflow releases since last activity on this issue. Kindly asking to recheck the report against latest Airflow version and let us know if the issue is reproducible. The issue will be closed in next 30 days if no further activity occurs from the issue author.
Apache Airflow version
2.5.1
What happened
When the scheduler finds zombies, a log emitted to indicate how many jobs was found without heartbeats.
An odd case where a task instance became a zombie right after being executed.
The task is scheduled and queued by the scheduler and passed to the executor.
Celery worker picks up task instance, assigns celery task id (uuid), and emits executor event into event_buffer.
Scheduler reads event_buffer and acknowledges the task instances as assigned in Celery.
The task instance is marked as zombie soon after.
Based on the task logs, the task run command never got to the task execution part.
Given the command execution encounter an exception before running the execute method, the StandardTaskRunner exited followed by the LocalTaskJob also exiting with the state success without handling the state of the task instance. At this point the state of the task instance is running because the LocalTaskJob successfully created the StandardTaskRunner.
A task instance in the running state with its corresponding LocalTaskJob in the success state means the task instance is now a zombie but not because of the lack of heartbeats.
What you think should happen instead
As explained above not all zombies are caused by missed heartbeat. When a
LocalTaskJob
succeeds or fails while the task instance is still in the running state, the task instance can also become a zombie.While It is true that the LocalTaskJob corresponding to the task instance does not have a heartbeat anymore. I think it is incorrect to say the LocalTaskJob does not have heartbeats after
scheduler_zombie_task_threshold
because that implies the LocalTaskJob was producing heartbeats before current time minusscheduler_zombie_task_threshold
seconds.It would be more accurate to say something like this.
The current wording makes more sense for the common case where the LocalTaskJob is unable to update the heartbeat while still in the running state and the task instance is also in the running state.
I would like either:
How to reproduce
This is hard to do since you will need to fail the airflow run command before _run_task_by_selected_method runs.
Operating System
n/a
Versions of Apache Airflow Providers
No response
Deployment
Astronomer
Deployment details
n/a
Anything else
No response
Are you willing to submit PR?
Code of Conduct