Closed fzhan closed 5 days ago
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.
Warm shutdown is triggered if a SIGINT is raised to the celery process. Can you please check your K8s events is a liveness probe kicked-in or some re-balancing in your Kubernetes was requesting to shut down a node? Warm shutdown is not usually triggered by the application itself except if a deployment change is made.
Also it would be good - not written in text - was this happening once? After how many tasks/time? Continously? On how many nodes? Any side effects of failed tasks?
@jscheffl thanks for the update and sorry for late reply, just being experimenting with different settings.
So after a fresh deployment, with no restriction to the resources, scheduler seems to be running quite stable. The latest log I can see before a restart is:
[2024-09-10T15:26:17.797+0000] {scheduler_job_runner.py:260} INFO - Exiting gracefully upon receiving signal 15
[2024-09-10T15:26:18.799+0000] {process_utils.py:132} INFO - Sending 15 to group 13035. PIDs of all processes in the group: [28335, 28336, 13035]
[2024-09-10T15:26:18.799+0000] {process_utils.py:87} INFO - Sending the signal 15 to group 13035
Prior to that, over the couple of days, it accumulated more than 400+ restarts.
Apache Airflow version
2.10.0
If "Other Airflow 2 version" selected, which one?
No response
What happened?
Workers kept going to warm shutdown
Liveness probe failed: Error: No nodes replied within time constraint
worker: Warm shutdown (MainProcess) [2024-08-23 02:46:40 +0000] [95] [INFO] Handling signal: term [2024-08-23 02:46:40 +0000] [97] [INFO] Worker exiting (pid: 97) [2024-08-23 02:46:40 +0000] [96] [INFO] Worker exiting (pid: 96) [2024-08-23 02:46:40 +0000] [95] [INFO] Shutting down: Master
What you think should happen instead?
The workers use to be alive.
How to reproduce
helm repo add apache-airflow https://airflow.apache.org helm upgrade --install airflow apache-airflow/airflow --namespace airflow --create-namespace
Operating System
kubernetes 1.29
Versions of Apache Airflow Providers
No response
Deployment
Official Apache Airflow Helm Chart
Deployment details
helm repo add apache-airflow https://airflow.apache.org helm upgrade --install airflow apache-airflow/airflow --namespace airflow --create-namespace
Anything else?
CONNECTION_CHECK_MAX_COUNT=0 exec /entrypoint python -m celery --app airflow.providers.celery.executors.celery_executor.app inspect ping -d celery@$(hostname) Error: No nodes replied within time constraint
Are you willing to submit PR?
Code of Conduct