apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
36.31k stars 14.09k forks source link

worker: Warm shutdown (MainProcess) #41685

Closed fzhan closed 5 days ago

fzhan commented 3 weeks ago

Apache Airflow version

2.10.0

If "Other Airflow 2 version" selected, which one?

No response

What happened?

Workers kept going to warm shutdown

Liveness probe failed: Error: No nodes replied within time constraint

worker: Warm shutdown (MainProcess) [2024-08-23 02:46:40 +0000] [95] [INFO] Handling signal: term [2024-08-23 02:46:40 +0000] [97] [INFO] Worker exiting (pid: 97) [2024-08-23 02:46:40 +0000] [96] [INFO] Worker exiting (pid: 96) [2024-08-23 02:46:40 +0000] [95] [INFO] Shutting down: Master

What you think should happen instead?

The workers use to be alive.

How to reproduce

helm repo add apache-airflow https://airflow.apache.org helm upgrade --install airflow apache-airflow/airflow --namespace airflow --create-namespace

Operating System

kubernetes 1.29

Versions of Apache Airflow Providers

No response

Deployment

Official Apache Airflow Helm Chart

Deployment details

helm repo add apache-airflow https://airflow.apache.org helm upgrade --install airflow apache-airflow/airflow --namespace airflow --create-namespace

Anything else?

CONNECTION_CHECK_MAX_COUNT=0 exec /entrypoint python -m celery --app airflow.providers.celery.executors.celery_executor.app inspect ping -d celery@$(hostname) Error: No nodes replied within time constraint

Are you willing to submit PR?

Code of Conduct

boring-cyborg[bot] commented 3 weeks ago

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

jscheffl commented 2 weeks ago

Warm shutdown is triggered if a SIGINT is raised to the celery process. Can you please check your K8s events is a liveness probe kicked-in or some re-balancing in your Kubernetes was requesting to shut down a node? Warm shutdown is not usually triggered by the application itself except if a deployment change is made.

jscheffl commented 2 weeks ago

Also it would be good - not written in text - was this happening once? After how many tasks/time? Continously? On how many nodes? Any side effects of failed tasks?

fzhan commented 5 days ago

@jscheffl thanks for the update and sorry for late reply, just being experimenting with different settings.

So after a fresh deployment, with no restriction to the resources, scheduler seems to be running quite stable. The latest log I can see before a restart is:

[2024-09-10T15:26:17.797+0000] {scheduler_job_runner.py:260} INFO - Exiting gracefully upon receiving signal 15
[2024-09-10T15:26:18.799+0000] {process_utils.py:132} INFO - Sending 15 to group 13035. PIDs of all processes in the group: [28335, 28336, 13035]
[2024-09-10T15:26:18.799+0000] {process_utils.py:87} INFO - Sending the signal 15 to group 13035

Prior to that, over the couple of days, it accumulated more than 400+ restarts.