Scheduler pods memory leak on airflow 2.3.2

bharatk-meesho commented 1 year ago

Apache Airflow version

Other Airflow 2 version (please specify below)

What happened

I am using airlfow 2.3.2 on EKS 1.22, the airflow service on EKS was launched by making minor modifications in official helm chart regarding replicas and resources. It is observed that the memory utilization by scheduler pod keeps on increasing as age of scheduler pod increases. Also this is observed when all dags are paused and nothing is running on airlfow. Different versions(config changes) of official helm chat was used to spin different airflow services in the EKS cluster where this issue was observed.

What you think should happen instead

This shouldn't have happened, the memory should have remain almost same. The increase in memory made the replicas to increase over few days as replication was setup based on memory utilisation even though there was no changes in the env itself and it was serving any traffic.

How to reproduce

Should be able to reproduce by using official airflow helm-chart of 2.3.2 on AWS EKS 1.22

Operating System

PRETTY_NAME=“Debian GNU/Linux 11 (bullseye)” NAME=“Debian GNU/Linux” VERSION_ID=“11” VERSION=“11 (bullseye)” VERSION_CODENAME=bullseye ID=debian

Versions of Apache Airflow Providers

No response

Deployment

Official Apache Airflow Helm Chart

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

[X] Yes I am willing to submit a PR!

Code of Conduct

[X] I agree to follow this project's Code of Conduct

bharatk-meesho commented 1 year ago

One more thing that was observed is that on checking the memory utilisation by process running on pod using top command, I observed that no more than 0.5 - 2% memory was utilised. I have a screen-shot but don't see any option to attach it here.

Taragolis commented 1 year ago

Hey @bharatk-meesho are you sure that "memory leaking" not related to one of this Issues/PRs

I have a screen-shot but don't see any option to attach it here.

Attach files by dragging & dropping, selecting or pasting them.

bharatk-meesho commented 1 year ago

Thanks @Taragolis I will take a look at these links. However I did tried one thing, I cleared the logs in /opt/airflow/scheduler and that didn't decreased the memory consumption. Also I checked the parameter in grafana dashboard which denotes memory utilized and the parameter is "container_memory_working_set_bytes".

Below is graph of scheduler pod level memory increasing from grafana although all dags are paused. And there has been no changes in the system whatsoever.

One another weird thing I noticed is that the memory utilization from top combined of all process doesn't match what I get from running "kubectl command". Attaching both screen-shots below.

HPA command output shows about 55% memory utilization and eventually it keeps on increasing so far i have obsreved.

While via top it doesn't seem to be more than 2% after SSHing into scheduler pod

Taragolis commented 1 year ago

Also would be nice to know about type of the memory are "leaked" might be it is some caches (you could find info in other issues).

And does it cause OOM?

potiuk commented 1 year ago

The container_memory_working_set_bytes should be non-evictable memory, so I think this is not a cache.

I think it will cause oom, but I also think this is one of your modifications - airflow does not seem to use more memory, it looks like your modifications.

I recommend you to install completely "vanilla" airflow - Helm and Image and see if you see the same growth. If not (I expect that), you can apply your modifications one-by-one and see which one causes the problem.

This is the best way I can advice on diagnosing those kind of issues. It's extremely hard to know what it is without applying such technique. It can be antyhing - scripts running in background for example. The fact that you do not see it in airflow's container, would suggest that it might be another container - init container maybe even liveness probe running and leaving smth behind.

bharatk-meesho commented 1 year ago

I checked the memory stats on vanilla airflow (directly used the image provided by airflow for 2.3.2 without any modifications, still saw memory increasing)

Going to try with 2.4.2 if this is fixed there.

potiuk commented 1 year ago

Do you know which process eats memory there? Are you using completely standard deployment with completely standard Helm Chart? Or do you have some modifications of yours?

bharatk-meesho commented 1 year ago

@potiuk how do I know which process eats memory? As shared above the result of top command, it doesn't where the memory is being utilized.

I am changing two things from official deployment process, below is my deployment process 1) Below is my docker file

FROM apache/airflow:2.3.2-python3.8

USER root
RUN apt-get update \
    && apt-get install -y --no-install-recommends \
    gcc g++ python-dev libsasl2-dev net-tools procps

I am using some libraries for extra functionality and for troubleshooting (since things like top is also not installed in pod)

2) Image in (1) is registered in ECR which I point to in helm chart. The only changes in helm chart regarding resources given to pod and their replication

3) I use argocd tool to deploy helm (This setup is provided by devops team in my org).

Please let me know if any other details I should share, I feel I am not modifying much from official helm/image so these memory issues shouldn't be coming.

potiuk commented 1 year ago

Do you see the same leaks WITHOUT modifying anything from basic airflow? I do not know if your modifications changed it - but just comparing it with "baseline" might give a hint. The whole point about debugging such problems is that they might be casue by changes that do not look suspicious and the BEST way of debugging them is to do something call bisecting - which is a valid debugging technique:

1) You have a base system with no modification at all and no memory leak 2) Yoy have a system with some modifications and a memory leak

Now - you are iterating over changes you made until you find tht one single change that causes the leak. This is usually fastest and most effective way of finding the root cause. If you have no obvious reason, this is the ONLY way. Even if you feel like "not much", I've seen totally unexpected things happening with "inocoulous" change.

@potiuk how do I know which process eats memory?

If I only knew a straightforward answer, I would give it to you. I usually use top or better Htop to observe what's going on and then I try to dig deeper if I see anything suspicious. But this has its limits due to complex nature of memory usage on Linux.

I am afraid I am not able to give simple answer to "how to check memory" - depending on which memory you observe leaking there might be sever different places, but none of them have simple recipes to use. Simply because memory in Linux is extremely complex subject - much more complex that you can think.

There are various ways applications, and kernel use memory and simple "do that" solution does not exist. If you try to find how to do it in google you will find plenty of "how you can approach it" - first example I found is this https://www.linuxfoundation.org/blog/blog/classic-sysadmin-linux-101-5-commands-for-checking-memory-usage-in-linux - where youbut you will not get direct answers but you will get a few tools that you can try to use to see if any of those will give you some of answer.

In many cases the kernel memory used will grow but it won't be attributed to eny single process (even if they are originated from the same process). At other times the memory used between processes will be partially (seemingly) duplicated, because they share "copy on wite" memory when processes forked and most of it is still shared.

Likely observing how various memory values in htop (suggested over top) should give you some clues. But It can be a kernel that is leaking memory and you will not see it there - https://unix.stackexchange.com/questions/97261/how-much-ram-does-the-kernel-use have some other debugging techniques to see it.

Most likely if you do not see any of the processes that are leaking memory, then likely you have kernel leaking it - which might mean many things - including your K8S instance is has for example shared volume with a buggy library (which will be nothing airflow might be aware about). Or even the monitoring software (i.e. grafana agent) might cause it. Hard to say and I am afraid I cannot help more by "try to pin-point the root cause".

This is also why pin-pointing is very important and often the fastest way to debug stuff . No-one will be able to "guess" what it is by even looking at the modification but getting it down to single change causing the leak might help in getting closer where to look for it.

potiuk commented 1 year ago

Another option for pin-pointing is to selectively disable certain processes and compare the usage before/after. For example if you see a pod running with multiple processes in it 0 you can delete certain processes in some containers - changing an entrypoint command to run with "sleep 3600" will run - likely everything else that there is to run with something that for sure does not take memory - and you can see which process caused it.

On top of that - again switching airfflow back to originl configuration and "vanilla" state might tell you for example that your configuration is your problem and applying configuraiotn (including logging handles, setting default values for host name check and many others might help with pin-pointing. It's almost certain Airlfow in the vanilla state has no leak. With the It would be far too easy to see - so it must be something on your side. The growth you should is pretty catastrophic and it would demand most of airflow installation to restart scheduler every day or so - which does not happen.

I also suggest (if you get there to vanilla and the memory is stil growing) to test different airflow versions - maybe what you see is a mistake- and trying various versions might simply give more answers. And finally if you see it in several airflow versions, I would try to run other experiments - replacing scheduler with other components etc. etc. Unfortnately I cannot have access to your system to play with it but if I were you, this is what I'd do.

BobDu commented 1 year ago

have the same problem in airflow 2.4.2 deploy use official helm chart, in aws eks 1.20 & 1.21, use offical docker iamge apache/airflow:2.4.2-python3.10

But, i deploy standalone dagProcessor. memory leak both happen on scheduler and dagprocessor.

bharatk-meesho commented 1 year ago

I am still not sure what was wrong but trying airflow 2.4.2 solved my issue. @BobDu maybe you can also try with 2.4.2.

potiuk commented 1 year ago

Yes. I suggest to upgrade to the latest version and see if it still happens. I also think the overall memory usage is not enough. I read a bit about the subject and the problem is far from simple.

Kubernetes not only runs your application but also runs monitoring, tweaks memory use for Pods via kubelet and additionally, when you run monitoring/Prometheus, the agent inside every pod migh impact the memory used by caching some stuff.

You can read a lot about it here for example https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/227

Even if you use grafana, grafana can cause the increase - depend on the version. Example here: https://github.com/grafana/loki/issues/5569

I think there is not much we can do in Airflow with WSS reporting showing those numbers unless someone can dig deeper and pin-point the memory usage to Airlfow, and not to other components (especially monitoring impacting the memory usage).

I suggest to upgrade to latest versions of everything you have (k8s, grafana, prometheus, airflow) and try again.

BobDu commented 1 year ago

RSS memory continue to increase, the same. I have used the 2.4.2 @bharatk-meesho . Thanks for everyone's help.

I will continue to investigate this issue, happy to share any progress.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has been open for 30 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.

github-actions[bot] commented 1 year ago

This issue has been closed because it has not received response from the issue author.

Kirgod commented 1 year ago

@BobDu any luck with investigation? Have same issue with Airflow Version: 2.6.1 :\

apache / airflow