Open waldoppper opened 7 months ago
@waldoppper , Did you mitigate this issue ? Found any temporary solution ?
We've tried manually deleting the task instance with no luck. (this makes me think that addressing this particular issue may not help us)
The only workaround I'm aware of is to disable deferral on operators.
@paramjeet01, we found that restarting the scheduler enables the task to start.
@paramjeet01 Its a bug in the Kubernetes executor. I am going to work on the fix for the same.
@dirrao are you working on this issue?
As per the Kubernetes executor, if the tasks that are stuck in a queued state and don't have associated tasks are rescheduled. If the tasks are stuck in a queued state and have associated pods, then those will be marked as failed. I believe this behviour seems to be correct.
@hussein-awala / @jedcunningham WDYT?
Apache Airflow version
Other Airflow 2 version (please specify below)
If "Other Airflow 2 version" selected, which one?
2.8.3
What happened?
I'm chasing an issue involving a hard-to-reproduce issue with deferrable operators being "stuck" in Queued state.
In debugging, I noticed what seems to be a defficiency in the kubernetes_executor's override of cleanup_stuck_queued_tasks is not actually
fail
ing the instance.What you think should happen instead?
Background: In debugging, I focused on the logs available at default log-level, which included
In reviewing the code printing this message, it seems clear that the expectation of a base_executor subclass is to
fail
the task-instances. For reference, the celery_executor does.Problem: The kubernetes_executor does not.
Solution: It seems to me like it should.
How to reproduce
The true root cause of my issue is still a mystery to me. This is an attempt at fixing this safety net.
Operating System
debian
Versions of Apache Airflow Providers
apache-airflow-providers-cncf-kubernetes==8.0.0
Deployment
Other 3rd-party Helm chart
Deployment details
No response
Anything else?
No response
Are you willing to submit PR?
Code of Conduct