apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
37.06k stars 14.29k forks source link

Scheduler Memory Leak in Airflow 2.0.1 #14924

Closed itispankajsingh closed 3 years ago

itispankajsingh commented 3 years ago

Apache Airflow version: 2.0.1

Kubernetes version (if you are using kubernetes) (use kubectl version): v1.17.4

Environment: Dev

What happened:

After running fine for some time my airflow tasks got stuck in scheduled state with below error in Task Instance Details: "All dependencies are met but the task instance is not running. In most cases this just means that the task will probably be scheduled soon unless: - The scheduler is down or under heavy load If this task instance does not start soon please contact your Airflow administrator for assistance."

What you expected to happen:

I restarted the scheduler then it started working fine. When i checked my metrics i realized the scheduler has a memory leak and over past 4 days it has reached up to 6GB of memory utilization

In version >2.0 we don't even have the run_duration config option to restart scheduler periodically to avoid this issue until it is resolved.

How to reproduce it: I saw this issue in multiple dev instances of mine all running Airflow 2.0.1 on kubernetes with KubernetesExecutor. Below are the configs that i changed from the default config. max_active_dag_runs_per_dag=32 parallelism=64 dag_concurrency=32 sql_Alchemy_pool_size=50 sql_Alchemy_max_overflow=30

Anything else we need to know:

The scheduler memory leaks occurs consistently in all instances i have been running. The memory utilization keeps growing for scheduler.

suhanovv commented 3 years ago

@potiuk I will be able to check this in the late afternoon or tomorrow, since we had work on the cluster and had to restart the container and now it has no cache.

The fact that this is a cache, I'm sure, added to the chart container_memory_cache

image

potiuk commented 3 years ago

Ah cool. So at least we figured that one out. Then it should be no problem whatsoever. One thing we COULD do is we could potentially add this hint to kernel to not add the log files to the cache if this is a Page Cache. It's not a harm in general to get this cache growing, but adding the hint might actually save us (and our users!) from diagnosing and investigating issues like this ;)

potiuk commented 3 years ago

@suhanovv -> this is the change you can try. https://github.com/apache/airflow/pull/18054 . While the ever growing cache is not a problem, possibly by implementing the advise to the kernel we can simply avoid this cache from growing in the first place.

lixiaoyong12 commented 3 years ago

We have deployed scheduler today and the memory is increased from 100 MB to 220 MB.

@lixiaoyong12 - what kind of memory you are talking about ? Is it container_memory_working_set_bytes or container_memory_cache ? I deployed the scheduler directly on the Linux operating system.

lixiaoyong12 commented 3 years ago

So I guess the quest continues. Hmm. Interesting one that it wen't down indeed after some time. If that's the cache then this would be strange to have container_memory_working_set_bytes (I presume the graph above is this?).

I have another hypothesis. Linux Kernel also has "dentries" and "inode" caches - it keeps in memory the used/opened directory structure and file node information. And I believe those caches would also be cleared whenever the log files are deleted.

If this is a cache, you can very easily check it - you can force cleaning the cache and see the results:

Cleaning just PageCache:

sync; echo 1 > /proc/sys/vm/drop_caches

Cleaning dentries and inodes:

sync; echo 2 > /proc/sys/vm/drop_caches

Can you make such experiment please?

sync; echo 1 > /proc/sys/vm/drop_caches ->It's down 40m, and there's more than 200

potiuk commented 3 years ago

I deployed the scheduler directly on the Linux operating system.

Still - you can see whether it's process or cache memory that grows:

For example here you can see how to check different types of memory used: https://phoenixnap.com/kb/linux-commands-check-memory-usage

Could you check what kind of memory is growing ?

lixiaoyong12 commented 3 years ago

We have deployed scheduler today and the memory is increased from 100 MB to 220 MB.

@lixiaoyong12 - what kind of memory you are talking about ? Is it container_memory_working_set_bytes or container_memory_cache ? I deployed the scheduler directly on the Linux operating system.

I use: ps auxww | grep airflow at different times. I found the memory is increased from 100 MB to 220 MB.

lixiaoyong12 commented 3 years ago

I deployed the scheduler directly on the Linux operating system.

Still - you can see whether it's process or cache memory that grows:

For example here you can see how to check different types of memory used: https://phoenixnap.com/kb/linux-commands-check-memory-usage

Could you check what kind of memory is growing ? i use pmap -p 203557 | grep anon , i found 00007efdc9d0d000 115968K rw--- [ anon ] that grows

potiuk commented 3 years ago

Can you please dump a few pmap outputs at different times and share it in .tar.gz or smth @lixiaoyong12 ? Without grep so that we can see everything. Ideally over of timespan of few hours so that we see that this is not a "temporary" fluctuation and see the trend ?

potiuk commented 3 years ago

Just to explain @lixiaoyong12 -> when you have a number of different dags and schedules, I think - depending on frequency etc. - this would be perfectly normal for scheduler to use more memory over time initially. Generally speaking it should stabilize after some time and then it will be fluctuating up/down dependning on what is happening. That's why I want to make sure this is not such a fluctuation, also if you could run periodically the cache cleanup and see if the memory is returning back to some more-or-less same value after some time. That would be most helpful!

potiuk commented 3 years ago

I updated the fix in #18054 (hopefully it will be ok now ) @suhanovv - in case you would like to try. I will wait for it to pass the tests but hopefully it will be ok now (mixed os.open with open 🤦 )

suhanovv commented 3 years ago

@potiuk Ok, we will deploy to the test stand today

suhanovv commented 3 years ago

@potiuk the last fix works as it should

image

potiuk commented 3 years ago

🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉

potiuk commented 3 years ago

Thanks a lot ! That might really help with user confusion!

sawaca96 commented 2 years ago

@potiuk

I use helm 1.6.0 and airflow 2.2.5

image image

why memory continuou increase? both shceduler and triggerer not webserver

potiuk commented 2 years ago

What kind of memory is it ? See the whole thread. There is different kind of memory and to might be observing cache memry growth for whatever reason.

Depending on the type of memory it might or might not be a problem. Buy you need gto investigate it in detail. No one is able to diagnose it without you investigating based on three thread.

The thread has all the relevant information. You need to see what process is leaking - whether it is airflow or system or some other process

BTW. I suggest you open a new discussion with all the details. There is little value in commenting on closed issue. Remember also this is a free forum where people help when they can and their help is much more efficient if you give all the information and show that you've done your part. There also companies offering help for Airflow for money and they can likely do the investigation for you.