apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
37.1k stars 14.3k forks source link

Scheduler Memory Leak in Airflow 2.0.1 #14924

Closed itispankajsingh closed 3 years ago

itispankajsingh commented 3 years ago

Apache Airflow version: 2.0.1

Kubernetes version (if you are using kubernetes) (use kubectl version): v1.17.4

Environment: Dev

What happened:

After running fine for some time my airflow tasks got stuck in scheduled state with below error in Task Instance Details: "All dependencies are met but the task instance is not running. In most cases this just means that the task will probably be scheduled soon unless: - The scheduler is down or under heavy load If this task instance does not start soon please contact your Airflow administrator for assistance."

What you expected to happen:

I restarted the scheduler then it started working fine. When i checked my metrics i realized the scheduler has a memory leak and over past 4 days it has reached up to 6GB of memory utilization

In version >2.0 we don't even have the run_duration config option to restart scheduler periodically to avoid this issue until it is resolved.

How to reproduce it: I saw this issue in multiple dev instances of mine all running Airflow 2.0.1 on kubernetes with KubernetesExecutor. Below are the configs that i changed from the default config. max_active_dag_runs_per_dag=32 parallelism=64 dag_concurrency=32 sql_Alchemy_pool_size=50 sql_Alchemy_max_overflow=30

Anything else we need to know:

The scheduler memory leaks occurs consistently in all instances i have been running. The memory utilization keeps growing for scheduler.

boring-cyborg[bot] commented 3 years ago

Thanks for opening your first issue here! Be sure to follow the issue template!

xsu-git commented 3 years ago

i have same problem. i looked for all the channels and methods but did not solve it! Thanks for your question and let me know what the problem is. I have to use crontab script to restart scheduler process regularly. But this is stupid. If can't solve it, I can only fall back to 1.10.

itispankajsingh commented 3 years ago

This issue is also there in version 1.10. however in version 2.0. the issue is more severe and also we don't have option of run_duration hence have to deploy our own cron jobs to refresh scheduler regularly. https://issues.apache.org/jira/browse/AIRFLOW-4593

raphaelauv commented 3 years ago

Until there is a fix or you find a specific reason, you could handle the OOM -> https://github.com/kubernetes/kubernetes/issues/40157

There is few initiatives :+1:

xsu-git commented 3 years ago

Thanks for yours comments. It seems that the memory leak problem has not been solved well. I can only call cron script to restart process every hour.

uranusjr commented 3 years ago

It seems like this is specific to the Kubernetes executor? It’d be awesome if you can confirm.

xsu-git commented 3 years ago

not completely.I use CeleryExecutor in the project also have this problem After schedule, VSZ and RSS keep growing always.

itispankajsingh commented 3 years ago

One observation I have is that the rate of memory leak increases with number of dags (irrespective of whether they are being run). It definitely has something to do with the dag parsing process.

xsu-git commented 3 years ago

yep.In my production environment, when using a small number of jobs, no problems are found temporarily. but as the business is connected, the increase in the amount of work will cause the schedule process die.

xsu-git commented 3 years ago

guys. i also have some other questions. In my project, use BashOperator task while get exception occasionally- 'Bash command failed. The command returned a non-zero exit code.' But it will succeed after retrying. No value in log file.I donot konw what reason is. If you have experience please let me know, thank you very much.

uranusjr commented 3 years ago

Please keep this thread on topic with the scheduler memory issue. For usage questions, please open threads in Discussions instead.

xsu-git commented 3 years ago

In my produce env.

used : linux release: Debain airflow version: 2.0.1 exeucutor = CeleryExecutor max_active_dag_runs_per_dag=32 parallelism=32 dag_concurrency=16 sql_Alchemy_pool_size=16 sql_Alchemy_max_overflow=16

about 3 workers,40 dags, 1000 tasks. Many tasks keep scheduled status sometimes and canot keep running. when I call cron script to restart process every hour.The problem haved sloved.

maison2710 commented 3 years ago

I also got similar issue with Airflow 2.0.1 when using Kubernetes executor. Is there any update or timeline for this issue?

brunoffaustino commented 3 years ago

We are also facing this issue right now. Any news? We use currently CeleryExecutor as @suxin1995 cc: @Cabeda @pmamarques

CapBananoid commented 3 years ago

Hi, if it can help I have this kind of leak on 2.1.2 celeryexecutor on kubernetes scheduler increased per approximatively 400M every day and I noticed a drop (from 2G to 500M) as soon as I deleted old logs in /opt/airflow/logs/scheduler (deleted all folders but the one pointed by latest)... I had 2G of logs and now back to 300M and didn't need to restart the scheduler for that drop to occur I don't know if it's really related or whatever but I think I will add a CronJob container on this volume to periodically do this

potiuk commented 3 years ago

That is cool finding.

suhanovv commented 3 years ago

we have the same problem airflow 2.1.3 in k8s cluster with CeleryExecutor

15 days of logs give us image ~ 6GB memory usage and ~1.7 after delete 12 days of logs

potiuk commented 3 years ago

Few questions.

Do you know which processes/containers keep the memory? Is it scheduler (and which container)? Maybe you can see the breakdown per process as well ? I understand this is whole cluster memory, and I am trying to wrap my head around it and see where it can come from, because it is super weird behaviour to get back memory after deleting files (?).

Dp you simply run "rm *" in the "/opt/airflow/logs/scheduler" and it drops immediately after? Or is there some delay involved? Do you do anything else than rm ? I understand you do not restart scheduler. Can you check if maybe the scheduler restarted itself by coincidence (or triggered by the deletion) ?

Maybe also you can see how many airflow related processes you have when scheduler runs? And maybe their number grows and then drops when you delete the logs?

CapBananoid commented 3 years ago

I did nothing but a rm and it dropped quite immediately (sorry the memory is brought back by prometheus andd you have delay but what I can tell you is that it dropped within 15s after I did the rm) I just did it on my dev instance in fact, same result The container is the scheduler, I run separate container for each service, this is the one with the command "airflow scheduler -n -1" I don't think the scheduler did a restart by itself, if it had been the case then kubernetes would have shown a failed container service and so would have restarted it and it's not the case I can tell you that the type of memory that is shown to grow is the one from the prometheus metric called container_memory_cache

CapBananoid commented 3 years ago

fun fact if I create new log folders (doing "cp -R 2021-09-01 XXX" for example) then memory rises 30s afterwards

potiuk commented 3 years ago

Ah right. The last line you wrote (container_memory_cache) is GOLD.

That probably would explain it and it's NOT AN ISSUE.

When you open many files Linux basically will use as much memory it can for file caches. Whenever you read or write a file, the blocks of disk are kept also in memory just in case the files needs to be accessed by any process. It also marks them dirty in case the blocks change and evicts such dirty blocks from memory. Also when some process needs more memory than it has available, it will evict some unused pages from memory to free them. Basically for any system, that writes files to logs continuously and the logs are not modified later, the cache memory will grow CONTINUOUSLY until the limit set by kernel configuration.

So depending on what your Kernel configuration is (basically the Kernel of your Kubernetes Virtual machines under the hood), you will see the metrics growing continuously (up to the kernel limit). You can limit the memory available to your Scheduler container to limit it "per container" (via giving it less memory resources) but basically as much memory you give to the scheduler container, it will be used for cache after some time (and will not be explicitly freed - but it's not a problem because the memory is effectively "free" - it's just used for cache and it can be freed immediately when needed).

That would PERFECTLY explain why the memory drops immediately after the files are deleted - those files are deleted so the cache for those files should also get deleted by the system immediately.

Instead of looking at total memory used you should look at the container_memory_working_set_bytes - metrics. It reflects the actually "actively used" memory. You can read more here: https://blog.freshtracks.io/a-deep-dive-into-kubernetes-metrics-part-3-container-resource-metrics-361c5ee46e66

You can also test it by running (from https://linuxhint.com/clear_cache_linux/):

echo 1 > /proc/sys/vm/drop_caches

In the container. This should drop your caches immediately without deleting the files.

potiuk commented 3 years ago

Actually one thing that it might be helping eve to keep the "cache" memory down (though it has barely no consequences). Do you happen to run any kind of automated log rotation ? We have a "clean-logs.sh" script in the official Image that can be run to clean the logs. This will have a side-effect of freeing the Page Cache memory used by that files: https://github.com/apache/airflow/blob/main/scripts/in_container/prod/clean-logs.sh

CapBananoid commented 3 years ago

I can set that launch with a Cron Job easily yes but even if I understand the cache thing, I don't get why it would cache files it doesn't even need to look at (when I create dummy folder in the logs folder) I will also try implementing limits in the scheduler container to see if it does the trick Strange thing is that this memory increase is not seen with the web service And finally, what you say is that it doesn't relate to the memory leak topic that's it ? or maybe the memory leak is a false flag for the freeze @itispankajsingh was concerned about

potiuk commented 3 years ago

I don't get why it would cache files it doesn't even need to look at (when I create dummy folder in the logs folder)

Well you do create a cache at the moment you WRITE the file (and when it is flushed to disk). This is simply how linux Page Cache works. You can specifically prevent the file to be written to cache when you save files but this is low-level API and very few systems do it because it has almost no effect besides dropping the metrics. See the discussion under "Optimizing Page Cache" in : https://www.thomas-krenn.com/en/wiki/Linux_Page_Cache_Basics

Basically - it is on a Kernel Level and the Kernel does not KNOW (unless it is explicitly told to) whether the file will be read after it is being written. So just in case it will cache it (because it is cost-free and might bring a lot of benefits - like 2s instead of 10s for reading a file that has been written recently). You might argue with it whether it is a good decision or not, but this is how it is (and I think it is a good design choice for performance).

Strange thing is that this memory increase is not seen with the web service

With the webserver, you likely already reached the memory limits and the whole available memory is used for cache. This is pretty much that happens on any long-running system that creates or reads a lot of files. Maybe you can compare the limits you have there with reported memory used.

And finally, what you say is that it doesn't relate to the memory leak topic that's it ? or maybe the memory leak is a false flag for the freeze @itispankajsingh was concerned about

I do not know that. I am just saying that IF you see container_memory_cache growing, this is pretty normal and expected and the container_memory_working_set_bytes is something that you should rather look at if you want to see if there is a memory leak.

suhanovv commented 3 years ago

my screenshot shows the metric container_memory_working_set_bytes, it decreased after 10 seconds (scraping time of metrics) after rm execution, scheduler did not restart

potiuk commented 3 years ago

my screenshot shows the metric container_memory_working_set_bytes, it decreased after 10 seconds (scraping time of metrics) after rm execution, scheduler did not restart

Ah. so investigation begins again then. @suhanov , Need some more answers :).

Just to clarify: So the 4GB memory drop (out of 6GB) was all from container_memory_working_set_bytes ?

Do you know which processes they were ? Do you have any other processes running in the container besides airflow ? Was there a drop in a number of processes when you deleted the files?

Maybe you can even look now and show the memory usage of the processes you have now in the container after some time of running the scheduler after deletion.

I'd love to get to the bottom of it, because I find it really surprising to find that removal of files causes memory drop. I think - besides the kernel cache - which is low level, you'd really have some kind of service that is subscribed to those files via FS "notify" kind of system to be able to free any memory as result of deleting a file.

Normally, if you have a file opened in linux and the file gets deleted, nothing special happens, the file is not actually removed until the last process that keeps the file opened is closed. So I find it really surprising to see such behaviour (that's why it is so interesting - because it is counter-intuitive).

Additional question: What KIND of filesystem you have for the logs in scheduler ? Is it a usual "local filesystem" or is it some kind of distributed, user-space kind of filesystem?

Because if it is the latter, then it could be the (for example user-space run) filesystem that keeps the log files data in memory and frees them after they are deleted.

suhanovv commented 3 years ago

my screenshot shows the metric container_memory_working_set_bytes, it decreased after 10 seconds (scraping time of metrics) after rm execution, scheduler did not restart

Ah. so investigation begins again then. @Suhanov , Need some more answers :).

Just to clarify: So the 4GB memory drop (out of 6GB) was all from container_memory_working_set_bytes ?

image yes

Do you know which processes they were ? Do you have any other processes running in the container besides airflow ? Was there a drop in a number of processes when you deleted the files?

after deleting files, the number of processes has not changed

Maybe you can even look now and show the memory usage of the processes you have now in the container after some time of running the scheduler after deletion.

image

I'd love to get to the bottom of it, because I find it really surprising to find that removal of files causes memory drop. I think - besides the kernel cache - which is low level, you'd really have some kind of service that is subscribed to those files via FS "notify" kind of system to be able to free any memory as result of deleting a file.

Normally, if you have a file opened in linux and the file gets deleted, nothing special happens, the file is not actually removed until the last process that keeps the file opened is closed. So I find it really surprising to see such behaviour (that's why it is so interesting - because it is counter-intuitive).

Additional question: What KIND of filesystem you have for the logs in scheduler ? Is it a usual "local filesystem" or is it some kind of distributed, user-space kind of filesystem?

local fs

Because if it is the latter, then it could be the (for example user-space run) filesystem that keeps the log files data in memory and frees them after they are deleted.

potiuk commented 3 years ago

@suhanovv - Is the container_memory_working_set_bytes growing still continuously now ? Can you please observe it for a while and when it grows, compare the memory used by processes and see which one is taking the GB of memory? I think knowing that would speed up any kind of hypothesis/investigation and make it waaaay easier.

suhanovv commented 3 years ago

@potiuk

At the moment, the container_memory_working_set_bytes has grown by about 150-200 mb, I will observe and try to further investigate the reason for this behavior. I will write the results in a few days

ashb commented 3 years ago

When I was digging in to a similar issue I couldn't see the memory attributed to any particular process -- only the whole container via working_set_bytes -- I was testing/looking in ps and all of the counters I could see in /proc/<pid>/ but didn't see any memory growth reflected in any of those

So that led me to believe the problem was not a traditional memory leak from the python code, but something OS related.

suhanovv commented 3 years ago

@potiuk @ashb Firstly, since yesterday's deletion of the logs, memory consumption has grown by about 500MB.

Second: my research led me to the following:

Since the processes during this time did not consume more memory, and there were no processes in /proc/*/fd that were holding descriptors to the old logs, I decided to see what the Container_memory_working_set_bytes metric is actually showing - "From the cAdvisor code, the working set memory is defined as: The amount of working set memory and it includes recently accessed memory, dirty memory, and kernel memory. Therefore, Working set is (lesser than or equal to) </ =" usage ".". After that, I conducted an experiment - with the removal of another day of logs with the output of vmstat before and after deletion. As it turned out, Container_memory_working_set_bytes - decreased by about 0.7-0.8 GB and in vmstat the system cache also decreased by 770 MB. Before image After image

Those. all these logs are stored in the cache, but it remains a mystery why they are in the cache for so long, our devops will also look from their side why this can happen.

raphaelauv commented 3 years ago

Thanks @suhanovv for your investigations

Do you have the aws S3 log export activated ? ( if it's the case , could you make a test where it's deactivated )

Last time I add "weird" (non process related) memory consumption it was related to a "memory leak" cause by aws boto3

potiuk commented 3 years ago

Do you have the aws S3 log export activated ? ( if it's the case , could you make a test where it's deactivated )

My thought exactly, feels like some 3rd-party process running and accessing the logs.

suhanovv commented 3 years ago

No, we don't use s3

potiuk commented 3 years ago

(BTW - is the first screenshot wrong @suhanovv ? It indicates "growing" memory rather than "shrinking" after delete).

suhanovv commented 3 years ago

sory it was wrong, fixed

suhanovv commented 3 years ago

I also came up with the idea to look at the stat by files, but unfortunately after deleting the logs, I will do it the next time I delete it, next week. But at the moment, all logs for September 2 have similar dates for accessing files:

Access: 2021-09-02 00:00:11.232221400 +0000 Modify: 2021-09-02 23:59:41.722766166 +0000 Change: 2021-09-02 23:59:41.722766166 +0000

potiuk commented 3 years ago

I looked a bit more. I have a hypothesis what could be the problem - I looked at the possibility that we have growing "dirty" memory (i.e. not flushed/synchronised to disk).

My hypothesis is that (at least some of) the logs are not flushed and they remain in "dirty" (or rather "unsynchronized") state. That would be fairly strange as usually "dirty" memory is synchronized after at most few seconds. And I believe that the standard "RotatingFileHandler" we use to write processor manager lgos should properly close and flush the streams anyway. But maybe you have some optimisatios/settings on your OS to prolong/disable auto-flushing (would be strange though) and maybe there is some special configuration/handler of logs that you write?

I found this nice PDF describing how PageCache actually works (fascinating read) http://sylab-srv.cs.fiu.edu/lib/exe/fetch.php?media=paperclub:lkd3ch16.pdf - Linux uses "write-back" cache strategy where it first writes data to cache and then it is flushed to disk (and remains in cache as non-dirty). The os will mark the files as dirty (and thus you can see them in `Container_memory_working_set_bytes'). If for whatever reason those files would remain as "dirty" they will still be counted as "working_set" memory. And that would also explain why the memory is freed after deleting the files - when the file gets deleted.

suhanovv commented 3 years ago

As far as I know, no optimizations were made, and the problem is not with the logs of the dag processor manager, but the problem with the logs of DagFileProcessorProcess and the default airflow.utils.log.file_processor_handler.FileProcessorHandler is used there without any modifications and custom configurations.

At the moment, I'm not sure than the problem is in the "dirty" memory,

image

but I will check tomorrow.

potiuk commented 3 years ago

Ok. not 100% sure this is the problem but I think it's very probable and I have potential fix in #18012. I believe we were not calling .close() method on FileHandler in case the FileProcessorHandler context changed (but I am not 100% sure if that's what happening). @ashb - care to take a look ?

potiuk commented 3 years ago

@suhanovv -> maybe you can apply the patch (maybe tomorrow) and test it ?

suhanovv commented 3 years ago

@potiuk I was just planning to test a similar fix on our installation, since by examining the code I could not find where the file is being closed.

ashb commented 3 years ago

So I think that the file handler will be closed by the GC finalizer (and since Python uses both ref counting and a period GC to detect loops) assigning self.handler to something else should GC the old handler and old FH.

Should. But it is entirely possible that the logging framework might have another reference to it somewhere hanging around.

In [2]: x = {'y': open("/etc/passwd")}
lrwx------ ash ash 64 B Fri Sep  3 14:48:12 2021  0 ⇒ /dev/pts/3
lr-x------ ash ash 64 B Fri Sep  3 14:49:06 2021  12 ⇒ /etc/passwd
x['y'] = open("/etc/group")

lrwx------ ash ash 64 B Fri Sep  3 14:48:12 2021  0 ⇒ /dev/pts/3
lr-x------ ash ash 64 B Fri Sep  3 14:49:26 2021  13 ⇒ /etc/group
potiuk commented 3 years ago

Hey @suhanovv - any results of the tests ?

suhanovv commented 3 years ago

@potiuk We have deployed today and while we are watching, tomorrow I will return with the results

potiuk commented 3 years ago

🤞

suhanovv commented 3 years ago

@potiuk fix does not work

image

while we have such a graph

image

for the old container without this patch and we cannot yet say why the memory consumption was falling - either it was container restarts or the os cleared the cache itself, we want to wait for such a situation to check if filling is possible memory cache is not a problem

lixiaoyong12 commented 3 years ago

We have deployed scheduler today and the memory is increased from 100 MB to 220 MB.

potiuk commented 3 years ago

So I guess the quest continues. Hmm. Interesting one that it wen't down indeed after some time. If that's the cache then this would be strange to have container_memory_working_set_bytes (I presume the graph above is this?).

I have another hypothesis. Linux Kernel also has "dentries" and "inode" caches - it keeps in memory the used/opened directory structure and file node information. And I believe those caches would also be cleared whenever the log files are deleted.

If this is a cache, you can very easily check it - you can force cleaning the cache and see the results:

Cleaning just PageCache:

sync; echo 1 > /proc/sys/vm/drop_caches

Cleaning dentries and inodes:

sync; echo 2 > /proc/sys/vm/drop_caches

Can you make such experiment please?

potiuk commented 3 years ago

We have deployed scheduler today and the memory is increased from 100 MB to 220 MB.

@lixiaoyong12 - what kind of memory you are talking about ? Is it container_memory_working_set_bytes or container_memory_cache ?