Closed qingling128 closed 6 years ago
Hmm... weird. Does this memory leak depend on traffic? And I want to know this also happens with latest fluentd v1.1.3 or not.
@repeatedly - We are setting up some soak test for various Fluentd versions (v0.12, v0.14 and v1.1). Will keep you posted once we have some numbers for the latest version. Hope it would be fixed in the latest version (there seems to be some fixes in between those versions).
I am on fluentd-1.0.2 and am experiencing the same i.e memory usage keeps growing up. Restart reclaims memory.
I am just tailing many files and sending to an aggregator instance.
I couldn't find anything interesting in dentry cache or in perf report, but, I will try the GC options as suggested here:
https://docs.fluentd.org/v1.0/articles/performance-tuning-single-process#reduce-memory-usage
RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR=0.9 had no effect:
Above is the the memory usage used by Ruby/Fluentd after the GC change.
There is a spike just after 3AM, interestingly this is when we move older files out of the path tailed by fluentd.
Will setting 'open_on_every_update' in the in_tail plugin help at all?
@cosmok That's interesting. How many files do you watch and could you paste your configuration to reproduce the problem?
@repeatedly At the moment, close to 30000 files are being watched (distributed across 4 fluentd processes - using the the multi worker plugin), each file is around 9 KB.
Config (simplified and scrubbed to remove sensitive details)
I think I'm also hitting this issue. Using a container based on fluentd-kubernetes-daemonset
tag v1.2-debian-elasticsearch
which has some more plugins on top.
We have one of this containers in each node for a kubernetes cluster. We were hitting some Buffer Overflow issues.
These are the last values values regarding buffer plugin in the conf (it's 0.x syntax but I've checked it gets converted correctly):
...
request_timeout 10s
buffer_type file
buffer_path /var/lib/fluentd/buffer/
buffer_chunk_limit 64m
buffer_queue_limit 20
buffer_queue_full_action drop_oldest_chunk
flush_interval 5s
heartbeat_interval 1
num_threads 2
...
I've played with most of the values, the above ones are the first that gave me stability in some of the pods, combined with setting RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR
to 0.9
.
The pods leaking memory seem to start doing so after logging messages about buffer overflow:
{"error":"#<Fluent::Plugin::Buffer::BufferOverflowError: buffer space has too many data>","location":"/fluentd/vendor/bundle/ruby/2.3.0/gems/fluentd-1.2.2/lib/fluent/plugin/buffer.rb:269:in `write'","tag":"kubernetes.var.log.containers.weave-scope-agent-fk2kn_kube-monitoring_agent-3a7bad7323da6331ecd1a214fa31ff727bf9071bce22e05054c8e0fc8c325d50.log","message":"emit transaction failed: error_class=Fluent::Plugin::Buffer::BufferOverflowError error=\"buffer space has too many data\" location=\"/fluentd/vendor/bundle/ruby/2.3.0/gems/fluentd-1.2.2/lib/fluent/plugin/buffer.rb:269:in `write'\" tag=\"kubernetes.var.log.containers.weave-scope-agent-fk2kn_kube-monitoring_agent-3a7bad7323da6331ecd1a214fa31ff727bf9071bce22e05054c8e0fc8c325d50.log\"","time":"2018-06-29T11:57:56+00:00"}
{"action":"drop_oldest_chunk","message":"failed to write data into buffer by buffer overflow action=:drop_oldest_chunk","time":"2018-06-29T11:57:56+00:00"}
{"chunk_id":"56fb88ad9bc0721711e7402bb1f53f36","message":"dropping oldest chunk to make space after buffer overflow chunk_id=\"56fb88ad9bc0721711e7402bb1f53f36\"","time":"2018-06-29T11:57:56+00:00"}
2018-06-29 11:57:56 +0000 [warn]: #0 failed to write data into buffer by buffer overflow action=:drop_oldest_chunk
2018-06-29 11:57:56 +0000 [warn]: #0 dropping oldest chunk to make space after buffer overflow chunk_id="56fb88ae1404d5e0ee2c411a5b1d0917"
2018-06-29 11:57:56 +0000 [warn]: #0 emit transaction failed: error_class=Fluent::Plugin::Buffer::BufferOverflowError error="buffer space has too many data" location="/fluentd/vendor/bundle/ruby/2.3.0/gems/fluentd-1.2.2/lib/fluent/plugin/buffer.rb:269:in `write'" tag="kubernetes.var.log.containers.container-log-ingest-4nhwz_kube-extra_ingest-35047c023d00d45626eadd72378c5562f459e22843a96f3bae65449861f26b46.log"
The above happens for 1674 lines and after all that is logged, the memory leak shows up. I'm repeating the tests to see if this is consistent, but the memory leaked for almost 50 minutes and then started to decrease memory usage at the same pace it was increasing during the leak (about 7MiB per minute).
Please let me know if there's any more information I could provide. It's being a bit difficult for me to reproduce, it only happens in our most logging-busy cluster, but I thought the timing of the buffer overflow and the memory leak would be worth mentioning.
Here's a picture of the monitoring of one of this containers. The drop in CPU happens when the memory starts to grow, and the logs show that this was the moment the Buffer Overflow happened. The memory starts to decrease when the Network I/O suddenly becomes way lower. The prometheus plug in exposes an endpoint that we use for liveness probe and it remained live for all this monitored time.
@rubencabrera Thanks for the detailed report. I'm now re-working this issue and try to reproduce the problem with your configuration.
Thanks @repeatedly
The report above was done in the middle of an upgrade to the version of fluentd and all the plugins we use. The problems we had with buffering seem solved now and we don't see the containers getting restarted so often (so I presume the memory issue is under control in this scenario).
I can only have a long running window in that cluster on weekends and this one I'll be out, but I could try again to remove the resource limit to see what happens. If the leak doesn't occur, maybe I could try to get the buffer error back to see if that's the problem. If you have any other ideas that could help, please let me know.
@repeatedly - We just tried the latest v1.2.3
(comparing with 0.14.25
and 0.12.41
). Seems the memory climbing issue is not getting better in the latest version.
Comparison among v1.2.3
, v0.14.25
and v0.12.41
over 5-day growth:
Blue: v1.2.3
(not much better either. It just started later).
Green: v0.14.25
Red: v0.12.41
Comparison among v0.14.25
and v0.12.41
over 19-day growth:
Red: v0.14.25
Blue: v0.12.41
It almost felt like GC kicked in really late and did not bring the memory allocation down sufficiently. So we also tried setting RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR
to 0.9, 1.2, 1.5 and 1.8 for both v0.14.25
and v1.2.3
. That did not help slowing down the memory climbing either though.
Any luck with the reproduction?
This is memory utilization of fluentd (running as a docker container) for 2-month (There is ~150 log-rows per second in last month)
@qingling128 Thanks. Do you use v0.12 API with v1.x, right?
@tahajahangir Do you use fluentd v1.x serise?
@repeatedly
# fluentd --version
fluentd 1.0.2
# ruby -v
ruby 2.3.3p222 (2016-11-21) [x86_64-linux-gnu]
We are going to test 1.2.2 next week.
I am experiencing the same problem. memory usage keeps growing up.
Environment: amazon linux 2 Fluentd version: starting fluentd-1.2.2 pid=1 ruby="2.4.4"
When I use jemalloc for ruby, the problem of memory leak disappears.
#!/bin/bash
export LD_PRELOAD=/usr/lib64/libjemalloc.so.1
export RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR=0.9
fluentd -c ./fluent.conf -o ./fluent.log -d ./fluent.pid
Are there some bugs in C extensions?
@repeatedly - We are using https://docs.fluentd.org/v1.0/articles/api-plugin-output. The page said This page is simply copied from v0.12 documents, and will be updated later.
. Looking at the content, I guess we should be using v1.X
API under the hood as long as we are using Fluentd v1.X
(correct me if I'm wrong).
In fact, I set up some stub gem and was still seeing the memory leak.
The stub gem is just sleeping 0.06 seconds for each chunk and not doing anything else:
require 'time'
module Fluent
# fluentd output plugin for the Stackdriver Logging API
class GoogleCloudOutput < BufferedOutput
Fluent::Plugin.register_output('google_cloud', self)
def write(chunk)
sleep(0.06)
end
def format(tag, time, record)
Fluent::Engine.msgpack_factory.packer.write([tag, time, record]).to_s
end
end
end
Memory usage
The LD_PRELOAD
var is interesting. We do set it in the package and install the package inside a container. I'll double check that it's also properly set in the container environment.
@qingling128 I don't think that is memory leak, looks like so many unreleased memory fragments.
And the problem has been fixed, when I set the LD_PRELOAD for libjemalloc or use a new ruby package compiling with option --with-jemalloc.
You can refer to https://www.speedshop.co/2017/12/04/malloc-doubles-ruby-memory.html.
I assume qingling128 uses jemalloc properly. I'm now investigating the problem with qingling128's stub code.
@lee2014 @repeatedly - Yeah, We were using jemalloc. And I have added a step to explicitly enforce that in the container in case the package did not set it up properly. Still getting similar results.
BTW, our configuration uses a combination of tail, parser, systemd, record_reformer, detect_exceptions plugins. The full configuration can be found at below. I'm setting up another experiment to trim that down to just using a simple tail plugin (so that we can rule out some noises). Will update with collected data.
If you have a time, could you use in_dummy
instead of in_tail
for testing?
On my ubuntu 16.04, in_dummy and your stub out_google_cloud doesn't increase RSS with jemalloc 4.5.0.
I want someone check in_tail with lots of files has a problem or not.
<source>
@type dummy
@id test_in
rate 5000 # Use actual message rate
dummy {your actual json here}
tag test
</source>
<match test>
@type google_cloud
@id test_out
buffer_type file
buffer_path /path/to/buffer/leak
buffer_queue_full_action block
buffer_chunk_limit 1M
buffer_queue_limit 6
flush_interval 5s
max_retry_wait 30
disable_retry_limit
num_threads 2
</match>
$ LD_PRELOAD=/path/to/jemalloc/4.5.0/lib/libjemalloc.so RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR=0.9 RUBYLIB=~/fluentd/lib ~/fluentd/bin/fluentd -c leak.conf -p .
Sure, I'll give in_dummy a try. BTW, I tried two configuration settings below. One triggered memory issue while the other did not.
Good one (with only in_tail plugin) good.txt
Bad one (in_tail plugin + multi_format, parser, record_reformer and record_transformer plugins) bad.txt
RSS usage
I just set up a few more experiments of some combination of multi_format, parser, record_reformer and record_transformer plugins. Hopefully that would narrow down the culprit.
In fact, scratch https://github.com/fluent/fluentd/issues/1941#issuecomment-409652705. The "good" one did not have log generator properly set up. I just fixed the log generator (shown as 2trim
). The memory issue seems reproducible with just in_tail plugin with this configuration based on the trend.
I also set up a dummy
experiment with in_dummy plugin.
Will let it soak a bit and update the thread.
I'm also heading an issue, i'm using 2 instances of fluentd in kubernetes and each of them consume 1gb memory
@repeatedly Seems like the memory issue went away after I replaced in_tail
with in_dummy
.
<source>
@type dummy
tag dummy_test
rate 1000
dummy {"message":"worldworldworldworldworldworldworldworldworldworldworldworldworldworldworldworldworldworldworldworld"}
</source>
@qingling128 Ah, good.
log generator
Could you share this script? I want to debug in_tail plugin on same setup.
Sure.
The script inside the log generator container we set up is: log_generator.txt
The dockerfile is: Dockerfile.txt
We are running this container and print to stdout. Kubernetes log stdout to /var/log/containers/***
and Fluentd tails from there. BTW we were using python log_generator.py --log-size-in-bytes=100 --log_rate=1000
.
If you are not using Kubernetes, you could simply change the print
statement to write to a log file instead.
Hi @repeatedly, any luck reproducing this?
@qingling128 I am running your log generator with stub code but still no luck on my Ubuntu environment. How many files to you tail? I'm now testing with 9 files (9000msg/sec).
We are tailing 30 files, each with 30 msg/sec.
By the way, I just started some simple setup in order to provide some reproduction steps without getting deeply involved in our infrastructure: https://github.com/qingling128/fluent-plugin-buffer-output-stub/tree/master.
I only just set up an experiment. Will let it run overnight to see if this can reproduce the issue. The configuration and setup are similar but not exactly the same with what we have in Kubernetes. But I will iterate on this to try to reproduce it.
We are tailing 30 files, each with 30 msg/sec.
Thanks. I will try it with similar setting.
Does anyone reproduce this problem on their environment without Docker/k8s? I run tests on my Mac and Ubuntu for several days but not reproduce the problem. So I want to know this problem happens on only Docker environment or not.
Here is my environment:
require 'parallel' # 'parallel' gem
nums = (0...30).to_a
Parallel.map(nums, in_threads: nums.size) { |n|
`python log_generator.py --log-size-in-bytes=100 --log-rate=100 >> log/data#{n}.log`
}
sleep
<source>
@type tail
path ./log/*.log
pos_file ./pos
format json
read_from_head true
tag test
</source>
<match test>
@type test # This plugin is same with https://github.com/qingling128/fluent-plugin-buffer-output-stub/tree/master
@id test_out
buffer_type file
buffer_path /path/to/buffer/test
buffer_queue_full_action block
buffer_chunk_limit 1M
buffer_queue_limit 6
flush_interval 5s
max_retry_wait 30
disable_retry_limit
num_threads 2
</match>
This is based on above. Just changed self._log_record
line to self._log_record = '{{"stream":"stdout","log":"{}"}}'.format(self._random_string(log_size_in_bytes))
for json
format in in_tail
fluentd: 1.2.4
ruby: 2.4.4
command: RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR=0.9 RUBYLIB=~/dev/fluentd/fluentd/lib ~/dev/fluentd/fluentd/bin/fluentd -c f.conf -p plugin
I also tried to reproduce it in a non-docker & non-k8s environment with similar configurations. No luck yet.
Hi @repeatedly @qingling128 ,
I am using neither docker or k8s. My config is as per my comment here: https://github.com/fluent/fluentd/issues/1941#issuecomment-382974082
Just a thought, would log rotation contribute to the issue? As I thought about the difference between the two setups (k8s v.s. no k8s), this is the first thing that crossed my mind.
Current GKE log rotation happens when log file exceeds 10MB. At the load of 100kb/s, the log file is rotated every (10 * 1024 / 100 = 102) seconds.
I see. I will run rotation script with par.rb and observe memory usage.
I noticed timer for rotated files are not released. Will fix and observe memory usage.
Sounds promising. Keep us posted. :D Thanks a lot!
patch is here: https://github.com/fluent/fluentd/pull/2105
Great! Let me know if there is anything I could help test. :)
I released v1.2.5.rc1 for testing.
You can install this version with --pre
option in gem install
Great. I've set up some test for that version. Will keep you posted.
Seems like that patch fixed the issue we had. Thank you so much @repeatedly !
Gem versions we tested with:
$ kubectl -n stackdriver-agents exec stackdriver-logging-agent-s5sqr ls /opt/google-fluentd/embedded/lib/ruby/gems/2.4.0/gems | grep fluent
fluent-logger-0.7.2
fluent-mixin-config-placeholders-0.4.0
fluent-mixin-plaintextformatter-0.2.6
fluent-plugin-detect-exceptions-0.0.10
fluent-plugin-google-cloud-0.6.23
fluent-plugin-mongo-0.7.13
fluent-plugin-multi-format-parser-0.1.1
fluent-plugin-prometheus-0.3.0
fluent-plugin-record-reformer-0.9.1
fluent-plugin-rewrite-tag-filter-1.5.5
fluent-plugin-s3-0.8.4
fluent-plugin-scribe-0.10.14
fluent-plugin-systemd-0.3.0
fluent-plugin-td-0.10.28
fluent-plugin-td-monitoring-0.2.2
fluent-plugin-webhdfs-0.4.2
fluentd-1.2.5.rc1
BTW, when are we expecting a formal release of fluentd-1.2.5
?
Released v1.2.5. Thanks for the testing.
Thank you!
Fluentd version: 0.14.25 Environment: running inside a debian:stretch-20180312 based container. Dockerfile: here
We noticed a slow memory leak that built up over a month or so.
The same setup that ran with Fluentd 0.12.41 have stable memory usage over the same period of time.
Still investigating and trying to narrow down versions. But wanna create a ticket to track this.
Config: