fluent / fluentd

Fluentd: Unified Logging Layer (project under CNCF)
https://www.fluentd.org
Apache License 2.0
12.92k stars 1.34k forks source link

File watchers might not be handled properly causing gradual increase in CPU/Memory usage #4381

Open uristernik opened 10 months ago

uristernik commented 10 months ago

Describe the bug

Fluentd tail plugin was outputting If you keep getting this message, please restart Fluentd. After coming across https://github.com/fluent/fluentd/issues/3614, we implemented the workaround suggested there.

Since than we are not seeing the original If you keep getting this message, please restart Fluentd but still seeing lots of Skip update_watcher because watcher has been already updated by other inotify event. This is paired with a pattern of memory leaking and gradual increase in CPU usage until a restart occurs. image

To mitigate this I added pos_file_compaction_interval 20m as suggested here but this had no affect on the resource usage.

image

Related to https://github.com/fluent/fluentd/issues/3614. More specifically https://github.com/fluent/fluentd/issues/3614#issuecomment-1871484810

The suspicion is that some Watchers are not handled properly thus leaking and increasing CPU/Memory consumption until the next restart.

To Reproduce

Deploy fluentd (version v1.16.3-debian-forward-1.0) as a daemonset in a dynamic kubernetes cluster. Cluster is consisting of 50-100 nodes. This is the fluentd config:

Expected behavior

CPU / Memory should stay stable.

Your Environment

- Fluentd version: [v1.16.3-debian-forward-1.0](https://github.com/fluent/fluentd-kubernetes-daemonset#:~:text=debian%2Dcloudwatch%2D1-,Forward,-docker%20pull%20fluent)

Your Configuration

<source>
  @type tail
  @id in_tail_container_logs
  path /var/log/containers/*.log
  pos_file /var/log/fluentd-containers.log.pos
  tag kubernetes.*
  read_from_head true
  follow_inodes true
  rotate_wait 0
  exclude_path ["/var/log/containers/fluentd*.log", "/var/log/containers/*kube-system*.log", "/var/log/containers/*calico-system*.log", "/var/log/containers/prometheus-node-exporter*.log", "/var/log/containers/opentelemetry-agent*.log"]
  pos_file_compaction_interval 20m
  <parse>
    @type multi_format
    <pattern>
      format json
      time_key time
      time_type string
      time_format "%Y-%m-%dT%H:%M:%S.%NZ"
      keep_time_key true
    </pattern>
    <pattern>
      format /^(?<time>.+?) (?<stream>stdout|stderr) (?<logtag>[FP]) (?<log>.+)$/
      time_format "%Y-%m-%dT%H:%M:%S.%N%:z"
    </pattern>
  </parse>
  emit_unmatched_lines true
</source>

### Your Error Log

```shell
Skip update_watcher because watcher has been already updated by other inotify event

Additional context

https://github.com/fluent/fluentd/issues/3614

daipom commented 10 months ago

Thanks for your report!

Fluentd tail plugin was outputting If you keep getting this message, please restart Fluentd. After coming across https://github.com/fluent/fluentd/issues/3614, we implemented the workaround suggested there.

  • changed follow_inodes to true
  • set rotate_wait to 0

So, follow_inodes false has a similar issue. Could you please report an issue of follow_inodes false in a new issue?

uristernik commented 10 months ago

@daipom In this case I had follow_inodes true

Do you want me to open a new issue just for tracking?

daipom commented 10 months ago

@uristernik Wasn't there a problem with follow_inodes false as well? I'd like to sort out each of follow_inodes false problem and follow_inodes true problem.

I'd like to know if there is any difference between follow_inodes false and follow_inodes true. For example, whether the same resource leakage occurs when follow_inodes false.

If there is no particular difference, we are fine with this for now. Thanks!

shadowshot-x commented 1 month ago

We are facing the same issue.

Error Message Skip update_watcher because watcher has been already updated by other inotify event path="/usr/local/logs/app/app.log" inode=20617294 inode_in_pos_file=0

We are using

read_from_head true
rotate_wait 30
follow_inodes true
enable_stat_watcher false

Memory keeps on gradually growing too! Any resolution on this?

daipom commented 1 month ago

@shadowshot-x Sorry for my late response. Thanks for your report. Could you please share the Fluentd (td-agent/fluent-package) version and OS?