apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
36.4k stars 14.11k forks source link

The Celery Worker Pods terminated and lost task logs when scale down happened during HPA autoscaling #41163

Open andycai-sift opened 1 month ago

andycai-sift commented 1 month ago

Official Helm Chart version

1.15.0 (latest released)

Apache Airflow version

2.7.1

Kubernetes Version

1.27.2

Helm Chart configuration

After official helm charts supports HPA, then here is my HPA settings.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: "airflow-worker"
  namespace: airflow
  labels:
    app: airflow
    component: worker
    release: airflow
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: StatefulSet
    name: "airflow-worker"
  minReplicas: 2
  maxReplicas:16
  metrics:
    - type: Resource
        resource:
          name: cpu
          target:
            type: Utilization
            averageUtilization: 70
    - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

Docker Image customizations

Not too much special packages, just added openjdk, mongo client and some other python related dependency.

What happened

When I enabled HPA in the Airflow cluster, then during the scale down some long running tasks which might be over 15hrs got failed and lost the logs. More details.

What you think should happen instead

It should be happened during HPA scaling down either the celery work pod need to graceful shutdown until all tasks are completed or Airflow should provide some config or something else to support kill the task process before final termination deadline and complete the cleanup and upload logs to remote settings like GCS bucket.

How to reproduce

Anything else

No response

Are you willing to submit PR?

Code of Conduct

boring-cyborg[bot] commented 1 month ago

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

dimon222 commented 1 month ago

Regarding first proposed solution - HPA waiting for completion of work: I'm afraid this is not fixable by Airflow project, as it's the limitation of Kubernetes that during scale down operation request you can't choose the "wait for completion of work" dynamic amount of time. The timing for graceful shutdown has to be configured at container start time, you can't update this when the container of celery worker is already running (the resource spec is immutable, you have to trigger "deploy" style operation to bounce container to do any changes).

I have been waiting for this for years, but K8s development community isn't working on it currently. https://github.com/kubernetes/enhancements/issues/2255#issuecomment-2261281830

Not sure about second proposed solution though. I'm not certain is Kubernetes sends upfront signal to remind that shutdown signal will come in grace period amount of seconds.