airflow-helm / charts

The User-Community Airflow Helm Chart is the standard way to deploy Apache Airflow on Kubernetes with Helm. Originally created in 2017, it has since helped thousands of companies create production-ready deployments of Airflow on Kubernetes.
https://github.com/airflow-helm/charts/tree/main/charts/airflow
Apache License 2.0
665 stars 476 forks source link

airflow scheduler and worker memory leak #683

Closed anu251989 closed 1 year ago

anu251989 commented 1 year ago

Checks

Chart Version

8.6.0

Kubernetes Version

Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.0", GitCommit:"af46c47ce925f4c4ad5cc8d1fca46c7b77d13b38", GitTreeState:"clean", BuildDate:"2020-12-08T17:59:43Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.14-eks-fb459a0", GitCommit:"b07006b2e59857b13fe5057a956e86225f0e82b7", GitTreeState:"clean", BuildDate:"2022-10-24T20:32:54Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}

EKS1.21

Helm Version

8.6.0

Description

The scheduler and Worker pods memory keep on increasing day by day. The worker and scheduler pods scaled up max pods.

image

image

Relevant Logs

No response

Custom Helm Values

AIRFLOW__CORE__PARALLELISM: "120"
    AIRFLOW__CORE__MAX_ACTIVE_TASKS_PER_DAG: "40"
    AIRFLOW__CELERY__WORKER_CONCURRENCY: "20"
    AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT: "600"
anu251989 commented 1 year ago

I have enabled the celery inspect command for checking celery health status. whenever the airflow redis pod restarted the worker pod disconnecting with redis and not processing any messages. This command checking celery status and taking restart worker pod. livenessProbe: exec: command:

I disabled the celery health checks and worker utilization came down. I still need to find out the reason for scheduler utilization.

How to fix the celery issue between redis and workers? if celery health checks disabled, memory leak issue resolved. but if worker disconnected with redis, the workers stay idle without processing any messages.

https://github.com/airflow-helm/charts/issues/600

anu251989 commented 1 year ago

We upgraded to 2.4.3 airflow version. The airflow redis pod killed and worker pods missed the heartbeat with worker then not processing any tasks. staying as idle. The redis and workers should resume the connection once it is up.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had activity in 60 days. It will be closed in 7 days if no further activity occurs.

Thank you for your contributions.


Issues never become stale if any of the following is true:

  1. they are added to a Project
  2. they are added to a Milestone
  3. they have the lifecycle/frozen label
thesuperzapper commented 1 year ago

@anu251989 are you still having this issue?

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had activity in 60 days. It will be closed in 7 days if no further activity occurs.

Thank you for your contributions.


Issues never become stale if any of the following is true:

  1. they are added to a Project
  2. they are added to a Milestone
  3. they have the lifecycle/frozen label
albertjmr commented 7 months ago

Hello @thesuperzapper, I'm noticing a similar issue to the one described in this issue.

I'll be doing some troubleshooting in the next couple of days but figured I should post in here too for visibility.

I'll start by adding a few extra tools to my custom airflow image to see what's consuming memory.

thesuperzapper commented 7 months ago

@albertjmr sounds good, please share any information you have.