Memory leak - Githubissues

Mosibi commented 3 years ago

Describe the bug

I noticed a memory issue, what looks to me as a leak. I had a more complex configuration, but boiled it down to this simple config just to rule out the specifics in my config. Even with this (see below) config, I observe the memory leakage.

fluentd_container_memory_usage

I run fluentd 1.12.3 from a container on a Kubernetes cluster via Helm

To Reproduce Deploy Fluentd 1.12.3 using Helm on a Kubernetes cluster using the provided config. Let it run for some time and observe the memory metrics

Expected behavior No memory leak

Your Environment Fluentd 1.12.3 on Kubernetes 1.17.7

Your Configuration

  <system>
      log_level info
      log_event_verbose true
      root_dir /tmp/fluentd-buffers/
    </system>

    <label @OUTPUT>
      <match **>
        @type file
        path /var/log/fluentd-output
        compress gzip
        <buffer>
          timekey 1d
          timekey_use_utc true
          timekey_wait 10m
        </buffer>
      </match>
    </label>

    ###
    # systemd/journald
    ###
    <source>
      @type systemd
      tag systemd
      path /var/log/journal
      read_from_head true

      <entry>
        fields_strip_underscores true
        fields_lowercase true
      </entry>

      @label @OUTPUT
    </source>

Additional context Using my more complex configuration I observe the same memory growth but there seems to be a correlation between the growth in size and the number of logs ingested. In other words, using more inputs (in my case, container logs), the (assumed) memory leak is larger.

kenhys commented 3 years ago

It seems related to #3342.

FYI: Recently, fluent-plugin-systemd 1.0.5 was released.

It fixes "Plug a memory leaks on every reload" https://github.com/fluent-plugin-systemd/fluent-plugin-systemd/pull/91

Does it reproducible even though fluent-plugin-systemd 1.0.5?

Mosibi commented 3 years ago

I tested with the complete config and then I still see the memory leak. Today I will run with only the systemd part to see if there is a change.

sumo-drosiek commented 3 years ago

I'm investigating similar issue from few days, and observed that for fluentd 1.11.5 using null output with file buffer also constantly takes more memory in opposition to null output without buffer section. I can reply tests for newer fluentd

Mosibi commented 3 years ago

I'm investigating similar issue from few days, and observed that for fluentd 1.11.5 using null output with file buffer also constantly takes more memory in opposition to null output without buffer section. I can reply tests for newer fluentd

Ah, that is good information. Please share the results with the latest fluentd version, if you can.

Is it possible to share the configuration you are using?

sumo-drosiek commented 3 years ago

This is my actual config

2021-06-01 12:30:47 +0000 [info]: using configuration file: <ROOT>
  <match fluentd.pod.healthcheck>
    @type relabel
    @label @FLUENT_LOG
  </match>
  <label @FLUENT_LOG>
    <match **>
      @type null
    </match>
  </label>
  <system>
    log_level info
  </system>
  <source>
    @type forward
    port 24321
    bind "0.0.0.0"
  </source>
  <match containers.**>
    @type relabel
    @label @NORMAL
  </match>
  <label @NORMAL>
    <match containers.**>
      @type copy
      <store>
        @type "null"
        <buffer>
          @type "file"
          path "/fluentd/buffer/logs.containers"
          compress gzip
          flush_interval 5s
          flush_thread_count 8
          chunk_limit_size 1m
          total_limit_size 128m
          queued_chunks_limit_size 128
          overflow_action drop_oldest_chunk
          retry_max_interval 10m
          retry_forever true
        </buffer>
      </store>
    </match>
  </label>
</ROOT>

Mosibi commented 3 years ago

It seems related to #3342.

FYI: Recently, fluent-plugin-systemd 1.0.5 was released.

It fixes "Plug a memory leaks on every reload" fluent-plugin-systemd/fluent-plugin-systemd#91

Does it reproducible even though fluent-plugin-systemd 1.0.5?

Yes, also with fluent-plugin-systemd version 1.0.5 I see the same behaviour. I can attach a new graph, but it shows exactly the same as the first one.

Mosibi commented 3 years ago

Thanks to @sumo-drosiek, I can confirm that no memory leak occurs when I do not use a buffer section in the output. I rerun everything using a new config with a buffer section to be 100% sure.

The config without buffers

<system>
     log_level info
     log_event_verbose true
     root_dir /tmp/fluentd-buffers/
   </system>

   <label @OUTPUT>
     <match **>
       @type null
     </match>
   </label>

   ###
   # systemd/journald
   ###
   <source>
     @type systemd
     tag systemd
     path /var/log/journal
     read_from_head true

     <entry>
       fields_strip_underscores true
       fields_lowercase true
     </entry>

     @label @OUTPUT
   </source>

06E6E73C-B55A-467E-93F5-C1957444D212

ashie commented 3 years ago

I've confirmed the issue. I think it's not fluentd's issue, it's fluent-plugin-systemd's issue. Please forward the report to https://github.com/fluent-plugin-systemd/fluent-plugin-systemd

Details:

I cannot reproduce it by replacing <source> with other plugins like in_tail (with tailing a huge file).
The issue is caused by read_from_head of in_systemd. When it's specified, in_systemd tries to read whole entries by running a busy loop: https://github.com/fluent-plugin-systemd/fluent-plugin-systemd/blob/9ffe8e14de75ca94e9bfe7428efd5c1a59421511/lib/fluent/plugin/in_systemd.rb#L144
Because of the busy loop, Ruby interpreter cannot run garbage collection until reach to the end of journals

I confirmed that it can be solved by running GC forcedly like this (it's too ad-hoc though):


diff --git a/lib/fluent/plugin/in_systemd.rb b/lib/fluent/plugin/in_systemd.rb
index adb8b3f..66f4373 100644
--- a/lib/fluent/plugin/in_systemd.rb
+++ b/lib/fluent/plugin/in_systemd.rb
@@ -120,6 +120,7 @@ module Fluent
     init_journal if @journal.wait(0) == :invalidate
     watch do |entry|
       emit(entry)

GC.start end end

Mosibi commented 3 years ago

@ashie I just wanted to add the results of my test run with buffering configured in an output where I see that without buffering there is no memory leak and with buffering a memory leak occurs

6F12048C-F14B-4F43-B6F9-41D57A510EC1

Mosibi commented 3 years ago

So to be complete:

No memory leak:

 <label @OUTPUT>
     <match **>
       @type null
     </match>
   </label>

Memory leak:

 <label @OUTPUT>
      <match **>
        @type null

        <buffer>
          @type "file"
          path "/var/log/fluentd-buffers/output.buffer"
          compress gzip
          flush_interval 5s
          flush_thread_count 8
          chunk_limit_size 1m
          total_limit_size 128m
          queued_chunks_limit_size 128
          overflow_action drop_oldest_chunk
          retry_max_interval 10m
          retry_forever true
        </buffer>
      </match>
    </label>

ashie commented 3 years ago

Thanks, further investigation may be needed at fluentd side. Reopen.

Mosibi commented 3 years ago

The following graph displays the memory over almost 7 days. This is my regular config where only the buffer sections in the Elasticsearch outputs is disabled. I was already sure that it was the buffer section, this is just more proof :)

fluentd-complete-config-without-buffers-2

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has been open 90 days with no activity. Remove stale label or comment or this issue will be closed in 30 days

sumo-drosiek commented 3 years ago

@ashie any update on this?

komalpcg commented 3 years ago

I am also facing this issue. Support has suggested to setup a CRON to restart service until a resolution comes: https://stackoverflow.com/questions/69161295/gcp-monitoring-agent-increasing-memory-usage-continuously.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has been open 90 days with no activity. Remove stale label or comment or this issue will be closed in 30 days

levleontiev commented 2 years ago

The issue is still here. Please unstall .

kvokka commented 2 years ago

Can it be connected with #3634?

tzulin-chin commented 2 years ago

Any update for this? I faced this issue with file buffer on output config

kvokka commented 2 years ago

for me it was a salvation in setting

            prefer_oj_serializer true
            http_backend typhoeus
            buffer_type memory

for elasticsearch match section.

And the aggregator pod is using all the ram which you give to the limit until ram release the ram. also I added

    extraEnvVars:
    # https://brandonhilkert.com/blog/reducing-sidekiq-memory-usage-with-jemalloc/?utm_source=reddit&utm_medium=social&utm_campaign=jemalloc
    - name: LD_PRELOAD
      value: /usr/lib/x86_64-linux-gnu/libjemalloc.so
    # https://www.speedshop.co/2017/12/04/malloc-doubles-ruby-memory.html
    - name: MALLOC_ARENA_MAX
      value: "2"

This helped to make this usable, but I still have no idea why it is that greedy with the ram

(by the way, forwarders are acting normal)

ofc I had to use manually set elasticsearch version to 7.17

I wasted tons of time on it and I'll be happy if some hints from these thoughts would be useful

ashie commented 2 years ago

    # https://www.speedshop.co/2017/12/04/malloc-doubles-ruby-memory.html
    - name: MALLOC_ARENA_MAX
      value: "2"

When you use jemalloc, MALLOC_ARENA_MAX doesn't make sense, so you can remove it.

sdwerwed commented 2 years ago

I face memory leak as well I use input:

forward
http for the liveness and readiness probes
systemd

output:

null for fluent logs
opensearch with file buffer
stdout for health checks.

The file buffer is always flushed but memory is getting increased.

Version:

activesupport (7.0.2.3)
addressable (2.8.0)
aws-eventstream (1.2.0)
aws-partitions (1.579.0)
aws-sdk-core (3.130.2)
aws-sdk-kms (1.56.0)
aws-sdk-s3 (1.113.1)
aws-sdk-sqs (1.51.0)
aws-sigv4 (1.5.0)
benchmark (default: 0.1.0)
bigdecimal (default: 2.0.0)
bundler (2.3.12, 2.3.11)
cgi (default: 0.1.0.1)
concurrent-ruby (1.1.10)
cool.io (1.7.1)
csv (default: 3.1.2)
date (default: 3.0.3)
delegate (default: 0.1.0)
did_you_mean (default: 1.4.0)
digest-crc (0.6.4)
domain_name (0.5.20190701)
elastic-transport (8.0.0)
elasticsearch (8.1.2)
elasticsearch-api (8.1.2)
elasticsearch-xpack (7.17.1)
etc (default: 1.1.0)
excon (0.92.2)
faraday (1.10.0)
faraday-em_http (1.0.0)
faraday-em_synchrony (1.0.0)
faraday-excon (1.1.0)
faraday-httpclient (1.0.1)
faraday-multipart (1.0.3)
faraday-net_http (1.0.1)
faraday-net_http_persistent (1.2.0)
faraday-patron (1.0.0)
faraday-rack (1.0.0)
faraday-retry (1.0.3)
faraday_middleware-aws-sigv4 (0.6.1)
fcntl (default: 1.0.0)
ffi (1.15.5)
ffi-compiler (1.0.1)
fiddle (default: 1.0.0)
fileutils (default: 1.4.1)
fluent-config-regexp-type (1.0.0)
fluent-plugin-concat (2.5.0)
fluent-plugin-dedot_filter (1.0.0)
fluent-plugin-detect-exceptions (0.0.14)
fluent-plugin-elasticsearch (5.2.2)
fluent-plugin-grafana-loki (1.2.18)
fluent-plugin-kafka (0.17.5)
fluent-plugin-kubernetes_metadata_filter (2.10.0)
fluent-plugin-multi-format-parser (1.0.0)
fluent-plugin-opensearch (1.0.2)
fluent-plugin-prometheus (2.0.2)
fluent-plugin-record-modifier (2.1.0)
fluent-plugin-rewrite-tag-filter (2.4.0)
fluent-plugin-s3 (1.6.1)
fluent-plugin-systemd (1.0.5)
fluentd (1.14.6)
forwardable (default: 1.3.1)
getoptlong (default: 0.1.0)
http (4.4.1)
http-accept (1.7.0)
http-cookie (1.0.4)
http-form_data (2.3.0)
http-parser (1.2.3)
http_parser.rb (0.8.0)
i18n (1.10.0)
io-console (default: 0.5.6)
ipaddr (default: 1.2.2)
irb (default: 1.2.6)
jmespath (1.6.1)
json (default: 2.3.0, 2.1.0)
jsonpath (1.1.2)
kubeclient (4.9.3)
logger (default: 1.4.2)
lru_redux (1.1.0)
ltsv (0.1.2)
matrix (default: 0.2.0)
mime-types (3.4.1)
mime-types-data (3.2022.0105)
minitest (5.15.0, 5.13.0)
msgpack (1.5.1)
multi_json (1.15.0)
multipart-post (2.1.1)
mutex_m (default: 0.1.0)
net-pop (default: 0.1.0)
net-smtp (default: 0.1.0)
net-telnet (0.2.0)
netrc (0.11.0)
observer (default: 0.1.0)
oj (3.3.10)
open3 (default: 0.1.0)
opensearch-api (1.0.0)
opensearch-ruby (1.0.0)
opensearch-transport (1.0.0)
openssl (default: 2.1.3)
ostruct (default: 0.2.0)
power_assert (1.1.7)
prime (default: 0.1.1)
prometheus-client (4.0.0)
pstore (default: 0.1.0)
psych (default: 3.1.0)
public_suffix (4.0.7)
racc (default: 1.4.16)
rake (13.0.6, 13.0.1)
rdoc (default: 6.2.1.1)
readline (default: 0.0.2)
readline-ext (default: 0.1.0)
recursive-open-struct (1.1.3)
reline (default: 0.1.5)
rest-client (2.1.0)
rexml (default: 3.2.3.1)
rss (default: 0.2.8)
ruby-kafka (1.4.0)
ruby2_keywords (0.0.5)
rubygems-update (3.3.11)
sdbm (default: 1.0.0)
serverengine (2.2.5)
sigdump (0.2.4)
singleton (default: 0.1.0)
stringio (default: 0.1.0)
strptime (0.2.5)
strscan (default: 1.0.3)
systemd-journal (1.4.2)
test-unit (3.3.4)
timeout (default: 0.1.0)
tracer (default: 0.1.0)
tzinfo (2.0.4)
tzinfo-data (1.2022.1)
unf (0.1.4)
unf_ext (0.0.8.1)
uri (default: 0.10.0)
webrick (1.7.0, default: 1.6.1)
xmlrpc (0.3.0)
yajl-ruby (1.4.2)
yaml (default: 0.1.0)
zlib (default: 1.1.0)

danielhoherd commented 2 years ago

I am also seeing a memory leak. Here is our Gemfile.lock and here is a 30 day graph showing the memory leak over several daemonset restarts. Each distinct color is one instance of fluentd servicing a single node:

bysnupy commented 2 years ago

After upgrading fluentd v1.7.4 to v1.14.6, the memory usage has been spiked from 400~500Mi to about 4Gi. And it trend keep growing slowly but infinitely day by day, even though there is NO configuration and workload changes. The following graph was regarding several hours.

Using plugin list:

output
- forward
- cloudwatch_logs (0.14.2)
- elasticsearch (5.2.2)
input
- tail
- systemd (1.0.5)
v1.14.6: https://github.com/ViaQ/logging-fluentd/blob/5bb1c5abbc8f29519e840eb06464c4af3ee59316/fluentd/Gemfile.lock
v1.7.4: https://github.com/ViaQ/logging-fluentd/blob/e62e4cac12768a041df6511ca1ac631192136655/fluentd/Gemfile.lock

Is there any updates to affect memory usage between the both versions ?

madanrishi commented 2 years ago

@Mosibi I have the same issue, I don't know if you have checked this or not, but seems to be a Kubernetes reporting issue, its reporting wrong metrics for memory, If you can shell into the pod and run free I think you will see the free memory is more than what is being reported by Kubernetes metrics itself. Kubernetes metrics is reporting memory usage including cache as well, may be thats the issue.

danielhoherd commented 2 years ago

@madanrishi do you have a link to the kubernetes issue that you think this might be?

seyal84 commented 2 years ago

Any update on this issue guys? I am facing a similar issue

madanrishi commented 2 years ago

@madanrishi do you have a link to the kubernetes issue that you think this might be? @danielhoherd My bad on this, on further investigation this is not the case, still have the same issue, statefulset restart fixes the issue but thats not an ideal resolution

pbrzica commented 1 year ago

Hi,

We have a memory leak in the following situation:

log_level is set to default
tail plugin is picking up a lot of non-matching lines (our format is set to json)

Changing the log_level to not log non-matching pattern messages fixes the memory leak (e.g. changing log_level to error)

Environment:

Kernel ranging from 3.10 to 5.16
CentOS Linux release 7.9.2009 (Core)
td-agent 4.4.2 fluentd 1.15.3 (e89092ce1132a933c12bb23fe8c9323c07ca81f5)

yangjiel commented 1 year ago

Facing the same issue, observed on cloudwatch_logs, log_analytics and logzio-0.0.22 plugins, slack plugin, memory consumption of each process keeps rising from hundred megabytes to several gigabytes in a few days.

td-agent 4.4.1 fluentd 1.13.3 (https://github.com/fluent/fluentd/commit/c32842297ed2c306f1b841a8f6e55bdd0f1cb27f)

yangjiel commented 1 year ago

My colleague Lester and I found out an issue related to this memory issue: https://github.com/fluent/fluentd/issues/4174

fluent / fluentd

Memory leak #3401