apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
35.21k stars 13.76k forks source link

kubernetes_executor does not properly fail task-instances in cleanup_stuck_queued_tasks #39078

Open waldoppper opened 2 months ago

waldoppper commented 2 months ago

Apache Airflow version

Other Airflow 2 version (please specify below)

If "Other Airflow 2 version" selected, which one?

2.8.3

What happened?

I'm chasing an issue involving a hard-to-reproduce issue with deferrable operators being "stuck" in Queued state.

In debugging, I noticed what seems to be a defficiency in the kubernetes_executor's override of cleanup_stuck_queued_tasks is not actually failing the instance.

What you think should happen instead?

Background: In debugging, I focused on the logs available at default log-level, which included

Marking task instance <TaskInstance: my_dag.my_group.my_task scheduled__2024-03-25T20:00:00+00:00 [queued]> stuck in queued as failed. If the task instance has available retries, it will be retried.

In reviewing the code printing this message, it seems clear that the expectation of a base_executor subclass is to fail the task-instances. For reference, the celery_executor does.

Problem: The kubernetes_executor does not.

Solution: It seems to me like it should.

How to reproduce

The true root cause of my issue is still a mystery to me. This is an attempt at fixing this safety net.

Operating System

debian

Versions of Apache Airflow Providers

apache-airflow-providers-cncf-kubernetes==8.0.0

Deployment

Other 3rd-party Helm chart

Deployment details

No response

Anything else?

No response

Are you willing to submit PR?

Code of Conduct

paramjeet01 commented 2 months ago

@waldoppper , Did you mitigate this issue ? Found any temporary solution ?

waldoppper commented 2 months ago

We've tried manually deleting the task instance with no luck. (this makes me think that addressing this particular issue may not help us)

The only workaround I'm aware of is to disable deferral on operators.

waldoppper commented 1 month ago

@paramjeet01, we found that restarting the scheduler enables the task to start.

dirrao commented 1 month ago

@paramjeet01 Its a bug in the Kubernetes executor. I am going to work on the fix for the same.

eladkal commented 6 days ago

@dirrao are you working on this issue?

dirrao commented 5 days ago

As per the Kubernetes executor, if the tasks that are stuck in a queued state and don't have associated tasks are rescheduled. If the tasks are stuck in a queued state and have associated pods, then those will be marked as failed. I believe this behviour seems to be correct.

@hussein-awala / @jedcunningham WDYT?