Open andycai-sift opened 1 month ago
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.
Regarding first proposed solution - HPA waiting for completion of work: I'm afraid this is not fixable by Airflow project, as it's the limitation of Kubernetes that during scale down operation request you can't choose the "wait for completion of work" dynamic amount of time. The timing for graceful shutdown has to be configured at container start time, you can't update this when the container of celery worker is already running (the resource spec is immutable, you have to trigger "deploy" style operation to bounce container to do any changes).
I have been waiting for this for years, but K8s development community isn't working on it currently. https://github.com/kubernetes/enhancements/issues/2255#issuecomment-2261281830
Not sure about second proposed solution though. I'm not certain is Kubernetes sends upfront signal to remind that shutdown signal will come in grace period amount of seconds.
Official Helm Chart version
1.15.0 (latest released)
Apache Airflow version
2.7.1
Kubernetes Version
1.27.2
Helm Chart configuration
After official helm charts supports HPA, then here is my HPA settings.
Docker Image customizations
Not too much special packages, just added openjdk, mongo client and some other python related dependency.
What happened
When I enabled HPA in the Airflow cluster, then during the scale down some long running tasks which might be over 15hrs got failed and lost the logs. More details.
terminationGracePeriodSeconds
was hitting. And then it caused the task failed, but actually the tasks still waiting the final response status from dataproc jobs.terminationGracePeriodSeconds
I set is 1hour right now, but due to the long running tasks so i am not sure how the HPA settings can support it.What you think should happen instead
It should be happened during HPA scaling down either the celery work pod need to graceful shutdown until all tasks are completed or Airflow should provide some config or something else to support kill the task process before final termination deadline and complete the cleanup and upload logs to remote settings like GCS bucket.
How to reproduce
terminationGracePeriodSeconds
maybe to a small number.Anything else
No response
Are you willing to submit PR?
Code of Conduct