salavessa commented 2 years ago

Describe the bug

After a warning of an "unreadable" (likely due to rotation), no more logs were pushed (in_tail + pos_file). Reloading config or restarting fluentd sorts the issue. All other existing files being tracked continued to work as expected.

To Reproduce

Not able to reproduce at will.

Expected behavior

Logs to be pushed as usual after file rotation as fluentd recovers from the temporary "unreadable" file.

Your Environment

- Fluentd version: 1.14.4
- TD Agent version: 4.2.0
- Operating system: Ubuntu 20.04.3 LTS
- Kernel version: 5.11.0-1022-aws

Your Configuration

<source>
  @type tail
  path /var/log/containers/*.log
  pos_file /var/log/es-containers.log.pos
  tag kubernetes.*
  <parse>
    @type regexp
    expression /^(?<time>[^ ]+) (?<stream>[^ ]+) (?<logtag>[^ ]+) (?<log>.+)$/
    time_key time
    time_format %Y-%m-%dT%H:%M:%S.%N%z
  </parse>
  read_from_head true
</source>

Your Error Log

Relevant log entries with some context. When "detected rotation of ..." isn't followed by a "following tail of ..." then log file contents aren't being processed/pushed:

2022-02-01 01:26:33 +0000 [info]: #0 detected rotation of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log; waiting 5 seconds
2022-02-01 01:26:33 +0000 [info]: #0 following tail of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log
2022-02-01 01:32:53 +0000 [warn]: #0 /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log unreadable. It is excluded and would be examined next time.
2022-02-01 01:32:54 +0000 [info]: #0 detected rotation of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log; waiting 5 seconds
2022-02-01 01:32:54 +0000 [info]: #0 following tail of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log
2022-02-01 01:38:04 +0000 [info]: #0 detected rotation of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log; waiting 5 seconds
2022-02-01 01:44:44 +0000 [info]: #0 detected rotation of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log; waiting 5 seconds
2022-02-01 01:53:15 +0000 [info]: #0 detected rotation of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log; waiting 5 seconds
[...]
---- after issuing a config reload (kill -SIGUSR2 <pid>) it starts to work fine again, i.e. "following tail of ..." 
2022-02-01 11:36:19 +0000 [info]: Reloading new config
2022-02-01 11:36:19 +0000 [info]: using configuration file: <ROOT>
[...]
2022-02-01 11:36:19 +0000 [info]: shutting down input plugin type=:tail plugin_id="object:c6c0"
[...]
2022-02-01 11:36:19 +0000 [info]: adding source type="tail"
2022-02-01 11:36:19 +0000 [info]: #0 shutting down fluentd worker worker=0
2022-02-01 11:36:19 +0000 [info]: #0 shutting down input plugin type=:tail plugin_id="object:c6c0"
[...]
2022-02-01 11:36:20 +0000 [info]: #0 restart fluentd worker worker=0
---- the entry below seems to be related with the actual underlying issue... the ruby object which stopped pushing logs has now been terminated as a new one was created
2022-02-01 11:36:20 +0000 [warn]: #0 /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log already exists. use latest one: deleted #<struct Fluent::Plugin::TailInput::Entry path="/var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log", pos=10740032, ino=1797715, seek=1530>
2022-02-01 11:36:20 +0000 [info]: #0 following tail of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log
2022-02-01 11:37:30 +0000 [info]: #0 detected rotation of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log; waiting 5 seconds
2022-02-01 11:37:30 +0000 [info]: #0 following tail of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log
2022-02-01 11:43:20 +0000 [info]: #0 detected rotation of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log; waiting 5 seconds
2022-02-01 11:43:20 +0000 [info]: #0 following tail of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log

Additional context

This issue seems be related with #3586 but unfortunately I didn't check the pos file while the issue was happening so can't tell if it presented unexpected values for the failing file.

daipom commented 1 year ago

@kattz-kawa Thanks so much! I didn't notice it! Sorry for my lack of progress. I want to make some conclusion this week.

daipom commented 1 year ago

I think the following is one of the causes.

https://github.com/fluent/fluentd/blob/0a6d706a9cee5882d751b2cc6169696709df0134/lib/fluent/plugin/in_tail.rb#L504-L518

In update_watcher, we don't update a watcher in the case of if new_position_entry.read_inode != 0. However, in this case, detach_watcher_after_rotate_wait is still called.

This seems wrong. We must not detach a watcher without removing it. (It is only allowed on closing.)

We need to fix it as follows.

           rotated_tw.unwatched = true if rotated_tw
           @tails[path] = setup_watcher(target_info, new_position_entry)
           @tails[path].on_notify
+          detach_watcher_after_rotate_wait(rotated_tw, pe.read_inode) if rotated_tw
         end
       else
         @tails[path] = setup_watcher(target_info, pe)
         @tails[path].on_notify
+        detach_watcher_after_rotate_wait(rotated_tw, pe.read_inode) if rotated_tw
       end
-      detach_watcher_after_rotate_wait(rotated_tw, pe.read_inode) if rotated_tw

I don't understand what the condition new_position_entry.read_inode == 0 means. Certainly, this condition protects against the case where refresh_watchers handles the rotation first. However, I'm not sure if the conditions were intended for it. What is certain is that new_position_entry.read_inode is NOT zero here means that the inode after rotation is NOT a new inode, and we must not detach the watcher in this case.

daipom commented 1 year ago

I'm creating some test codes for this issue. Based on the tests, I'd like to figure out how to adapt the modifications that have come up so far.

daipom commented 1 year ago

https://github.com/fluent/fluentd/issues/3614#issuecomment-1598505831

I confirmed this behavior with the #4207 tests. This solves the problem of stopping pushing logs, but handle leaks still occur (the same as the current impl).

daipom commented 1 year ago

#3614 (comment)

I confirmed this behavior with the #4207 tests. This solves the problem of stopping pushing logs, but handle leaks still occur (the same as the current impl).

In this direction, I created the PR:

4208

masaki-hatada commented 1 year ago

@daipom san,

Thank you for the reply!

I think the following is one of the causes.

https://github.com/fluent/fluentd/blob/0a6d706a9cee5882d751b2cc6169696709df0134/lib/fluent/plugin/in_tail.rb#L504-L518

In update_watcher, we don't update a watcher in the case of if new_position_entry.read_inode != 0. However, in this case, detach_watcher_after_rotate_wait is still called.

This seems wrong. We must not detach a watcher without removing it. (It is only allowed on closing.)

We need to fix it as follows.
           rotated_tw.unwatched = true if rotated_tw
           @tails[path] = setup_watcher(target_info, new_position_entry)
           @tails[path].on_notify
+          detach_watcher_after_rotate_wait(rotated_tw, pe.read_inode) if rotated_tw
         end
       else
         @tails[path] = setup_watcher(target_info, pe)
         @tails[path].on_notify
+        detach_watcher_after_rotate_wait(rotated_tw, pe.read_inode) if rotated_tw
       end
-      detach_watcher_after_rotate_wait(rotated_tw, pe.read_inode) if rotated_tw
I don't understand what the condition new_position_entry.read_inode == 0 means. Certainly, this condition protects against the case where refresh_watchers handles the rotation first. However, I'm not sure if the conditions were intended for it. What is certain is that new_position_entry.read_inode is NOT zero here means that the inode after rotation is NOT a new inode, and we must not detach the watcher in this case.

We also had the same concern but hesitated to change this part by the following reasons:

This is not relating to a phenomenon reported in #3614 .
This code was implemented 3 years ago by the following commit. We don't know why detach_watcher_after_rotate_wait() wasn't moved under if new_position_entry.read_inode == 0 block in this commit. https://github.com/fluent/fluentd/commit/b83f73e5c8830a8e56bc693ad8a6a3e316a9dcfd#diff-456fdeb51bc472beb48891caac0d063e0073655dba7ac2b72e6fdc67dc6ac802R477
We hadn't faced any real issues about this in our test.

I agree with you that detach_watcher_after_rotate_wait() should be moved as you commented.

in_tail: Ensure to detach correct watcher on rotation with follow_inodes #4208

4208 is the replacement of #4185, isn't it?

We will check #4185. (Thank you for adding me and @kattz-kawa as Co-author ^_^)

daipom commented 1 year ago

@masaki-hatada

https://github.com/fluent/fluentd/pull/4208 is the replacement of https://github.com/fluent/fluentd/pull/4185, isn't it? We will check https://github.com/fluent/fluentd/pull/4185.

I would appreciate it if you could check #4208 too. I want to merge one of #4208 and #4185. (I will review #4191 later) Both would improve this problem. I think #4208 would be a more direct solution, but I want to hear opinions.

(Thank you for adding me and @kattz-kawa as Co-author ^_^)

I can create #4208 thanks to #4185 and #4191! Until these PRs were created, I had no idea what was causing this problem. Thank you all for your contributions!

masaki-hatada commented 1 year ago

We will check https://github.com/fluent/fluentd/pull/4185. I would appreciate it if you could check https://github.com/fluent/fluentd/pull/4208 too.

Sorry, "We will check https://github.com/fluent/fluentd/pull/4185" was a typo ^^; I will check #4208!

ashie commented 1 year ago

Not yet confirmed for follow_inode false case. Reopen.

ashie commented 1 year ago

Not yet confirmed for follow_inode false case. Reopen.

The mechanism of #4190 doesn't depend on follow_inode, so it definitely affects to follow_inode false case too and #4208 should fixes it. I believe #4190 is the root cause of this issue.

I'll close this issue after we check it.

slvr32 commented 1 year ago

Hi @ashie, have you had a chance to confirm this fix?

daipom commented 1 year ago

Sorry, I have not been able to confirm follow_inodes false case yet. I will in the near future.

daipom commented 1 year ago

I added a test to reproduce follow_inodes false problem.

4264

It is not fixed yet. We need to fix it.

I understand the mechanism. I will explain it in detail later.

tmavrakis commented 1 year ago

Any update on the case with follow_inode false? @daipom @ashie Still getting "Skip update_watcher ..." log.

Fluentd image: fluent/fluentd:v1.16.2-1.0 Configuration: path "/var/log/service.log" path_key log_path refresh_interval 20 pos_file "/var/log/fluentd.pos" read_from_head true enable_stat_watcher false pos_file_compaction_interval 60s rotate_wait 1s

daipom commented 1 year ago

@tmavrakis Thanks for your report. I know follow_inode false still has a problem. Sorry I can't make time for this problem. I will handle this in September.

Could you share the content of the pos file when the problem occurs?

tmavrakis commented 1 year ago

Thanks for the update @daipom. pos file seems to be ok. /var/log/service.log 000000000097d9f3 00000000002f710b

daipom commented 1 year ago

I added a test to reproduce follow_inodes false problem.
* [in_tail: add test updating TailWatcher without follow_inodes #4264](https://github.com/fluent/fluentd/pull/4264)
It is not fixed yet. We need to fix it.

I understand the mechanism. I will explain it in detail later.

Sorry for being late.

I summarize this follow_inodes false problem below.

When follow_inodes false, in_tail can not tail a new current log file if certain conditions are met. I think this is very severe conditions. I assume that some environments are likely to have this problem and others not so much.

The main cause is the rotate_wait feature. refresh_watcher() can wrongly unwatch a TailWathcer still working, and then it can stop the TailWatcher's update.

Here is the mechanism.

refresh_watcher() calls stop_watcher() when rotating occurs.
- It is possible if refresh_watcher() is called when the current log file does not exist. (The previous current file is renamed, but the new file is not created yet.)
update_watcher() for the rotation occurs after the refresh_watcher() and within retate_wait.
- This updates the TailWatcher.
- I think this is unlikely to occur in many environments. Normally, the enable_stat_watcher feature calls update_watcher() immediately, before refresh_watcher().
- It may be more likely to occur in environments where the enable_stat_watcher feature is slow for some reason or enable_stat_watcher is set to false.
Then, after the rotate_wait interval, the position is unwatched, but that position is still used in the updated TailWatcher.
- The PositionEntry is deleted from the run-time data (PositionFile::map).
- However, the TailWatcher has the PositionEntry and updates the pos file continuously.
- Since the TailWatcher updates the PositionEntry continuously, the unwatched status is lost from the pos file. (FFFF... is overwritten by the new position.)
In this situation, the TailWatcher works correctly until the next rotation.
When the next rotation occurs, update_watcher is wrongly skipped because the PositionEntry is already deleted from the run-time data (PositionFile::map).
- https://github.com/fluent/fluentd/blob/78c91468d65da7cce0999fd94dbe61f7a35b606b/lib/fluent/plugin/in_tail.rb#L526-L529
- (Fluentd v1.15.1 (td-agent v4.4.0) and later) The WARN log Skip update_watcher because watcher has been already updated by other inotify event occurs. (Fluentd v1.15.1 and later)
- Before that version, it was DEBUG level.
As a result, the TailWatcher with the old file handler continues working, and the new log file is not tailed.

The feature of this problem.

Multiple position entries for the same path exist in the pos file.
Log: Skip update_watcher because watcher has been already updated by other inotify event
Log: ... already exists. use latest one: deleted #<...> (When restarting Fluentd)

Workaround:

follow_inodes true (Fluentd v1.16.2 (fluent-package v5.0.0) or later)
rotate_wait 0

daipom commented 1 year ago

Thanks for the update @daipom. pos file seems to be ok. /var/log/service.log 000000000097d9f3 00000000002f710b

Sorry for being late. I currently think the above is the cause of this problem.

I'm not sure why your pos file has nothing wrong.

After restarting Fluentd, the pos file will recover with the log ... already exists. use latest one: deleted #<...>. I'm guessing that you restarted Fluentd after this problem and the file was recovered.

If there seems to be a different phenomenon going on than I assumed, please let me know.

artheus commented 1 year ago

My suggestion to fix this would be to add a native component to fluentd for handling the in_tail properly. Using the in_tail code for fluent-bit, to make a native component would probably not be much work to get it working properly.

It could be built in a separate project, which would allow for the build of fluentd to just download the binaries as a dependency.

From what I can gather, from struggling with this issue for a while now, this would be the safest solution to this problem.

On Wed, 27 Sep 2023, 18:05 Daijiro Fukuda @.***> wrote:

Thanks for the update @daipom https://github.com/daipom. pos file seems to be ok. /var/log/service.log 000000000097d9f3 00000000002f710b

Sorry for being late. I currently think the above is the cause of this problem.

I'm not sure why your pos file has nothing wrong.

After restarting Fluentd, the pos file will recover with the log ... already exists. use latest one: deleted #<...>. I'm guessing that you restarted Fluentd after this problem and the file was recovered.

If there seems to be a different phenomenon going on than I assumed, please let me know.

— Reply to this email directly, view it on GitHub https://github.com/fluent/fluentd/issues/3614#issuecomment-1737687673, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEXDQX7DDBTGIP6XUYLVC3X4RFDJANCNFSM5NJYHT4A . You are receiving this because you commented.Message ID: @.***>

artheus commented 1 year ago

I've also made a small tool in Golang for easily converting the pos files into the SQLLite db used by fluent-bit.

I'll look into sharing this tool here, as soon as I have the opportunity to do so.

On Sat, 7 Oct 2023, 11:33 Morten Hekkvang @.***> wrote:

My suggestion to fix this would be to add a native component to fluentd for handling the in_tail properly. Using the in_tail code for fluent-bit, to make a native component would probably not be much work to get it working properly.

It could be built in a separate project, which would allow for the build of fluentd to just download the binaries as a dependency.

From what I can gather, from struggling with this issue for a while now, this would be the safest solution to this problem.

On Wed, 27 Sep 2023, 18:05 Daijiro Fukuda @.***> wrote:

Thanks for the update @daipom https://github.com/daipom. pos file seems to be ok. /var/log/service.log 000000000097d9f3 00000000002f710b

Sorry for being late. I currently think the above is the cause of this problem.

I'm not sure why your pos file has nothing wrong.

After restarting Fluentd, the pos file will recover with the log ... already exists. use latest one: deleted #<...>. I'm guessing that you restarted Fluentd after this problem and the file was recovered.

If there seems to be a different phenomenon going on than I assumed, please let me know.

— Reply to this email directly, view it on GitHub https://github.com/fluent/fluentd/issues/3614#issuecomment-1737687673, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEXDQX7DDBTGIP6XUYLVC3X4RFDJANCNFSM5NJYHT4A . You are receiving this because you commented.Message ID: @.***>

artheus commented 1 year ago

Here's the pos file migration script I wrote. (Seems I forgot that I just wrote it in bash, and not Golang)

It requires sqlite3 to be installed, for the script to work. Installation guides can be found here: https://www.tutorialspoint.com/sqlite/sqlite_installation.htm

#!/usr/bin/env bash

info() {
  printf "[%s] INFO - %s\n" "$(date --iso-8601=seconds )" "$@"
}

readonly DB='/opt/fluent-bit-db/log-tracking.db'
readonly FLUENTD_LOG_POS="/var/log/fluentd-containers.log.pos"

if [[ ! -f "$FLUENTD_LOG_POS" ]]; then
  info "No FluentD log tracking file to migrate from"
  exit
fi

if [[ ! -f "$DB" ]]; then
  sqlite3 "$DB" "CREATE TABLE main.in_tail_files (id INTEGER PRIMARY KEY, name TEXT, offset INTEGER, inode INTEGER, created INTEGER, rotated INTEGER);"
else
  info "fluent-bit database already exists, will not do migration"
  exit
fi

while read -r line; do
  IFS=$'\t' read -r -a parts <<< "$line"

  filename="${parts[0]}"
  offset="$((16#${parts[1]}))"
  inode="$((16#${parts[2]}))"
  now="$(date +%s)"

  sqlite3 "$DB" "INSERT INTO in_tail_files (name, offset, inode, created, rotated) VALUES ('$filename', $offset, $inode, $now, 0)"
done < <(sort "$FLUENTD_LOG_POS")

There are no security stuff like escaping values for INSERT INTO or anything, but it's intended usage is in controlled environments. I am guessing there are plenty of things to improve on, but this has worked well on our internal k8s testing nodes, where we've used it as an init container for the fluent-bit pod which takes over for Fluentd.

daipom commented 11 months ago

Thank you all very much! Thanks to all your help, we were able to identify the logic problems and fix them!

The problem that in_tail stops collecting is fixed at Fluentd v1.16.3, Fluent Package v5.0.2 and td-agent v4.5.2. Please use them! If you still have problems, we'd be very grateful if you report to us.

alexei-g-aloteq commented 10 months ago

No, it is not fixed. We've switched to image 1.16.3-debian-elasticsearch8-amd64-1.0, but still see almost daily outage.

uristernik commented 10 months ago

We are still seeing this issue as well. We are using this image v1.16.3-debian-forward-1.0.

We were seeing the following message If you keep getting this message, please restart Fluentd and like suggested here we changed follow_inodes to true and set rotate_wait to 0, but we are still seeing loads of Skip update_watcher because watcher has been already updated by other inotify event

We also noticed a pattern of memory leaking and gradual increase in CPU usage until a restart occurs. We are using fluentd as a daemonset on a kubernetes cluster. Here is our in_tail config:

<source>
  @type tail
  @id in_tail_container_logs
  path /var/log/containers/*.log
  pos_file /var/log/fluentd-containers.log.pos
  tag kubernetes.*
  read_from_head true
  follow_inodes true
  rotate_wait 0
  exclude_path ["/var/log/containers/fluentd*.log"]
  <parse>
    @type multi_format
    <pattern>
      format json
      time_key time
      time_type string
      time_format "%Y-%m-%dT%H:%M:%S.%NZ"
      keep_time_key true
    </pattern>

    <pattern>
      format /^(?<time>.+?) (?<stream>stdout|stderr) (?<logtag>[FP]) (?<log>.+)$/
      time_format "%Y-%m-%dT%H:%M:%S.%N%:z"
    </pattern>

  </parse>
  emit_unmatched_lines true
</source>

<filter kubernetes.**>
  @type concat
  key log
  partial_key logtag
  partial_value P
  separator ""
</filter>

uristernik commented 10 months ago

We are still seeing this issue as well. We are using this image v1.16.3-debian-forward-1.0.

We were seeing the following message If you keep getting this message, please restart Fluentd and like suggested here we changed follow_inodes to true and set rotate_wait to 0, but we are still seeing loads of Skip update_watcher because watcher has been already updated by other inotify event

We also noticed a pattern of memory leaking and gradual increase in CPU usage until a restart occurs. We are using fluentd as a daemonset on a kubernetes cluster. Here is our in_tail config:
<source>
  @type tail
  @id in_tail_container_logs
  path /var/log/containers/*.log
  pos_file /var/log/fluentd-containers.log.pos
  tag kubernetes.*
  read_from_head true
  follow_inodes true
  rotate_wait 0
  exclude_path ["/var/log/containers/fluentd*.log"]
  <parse>
    @type multi_format
    <pattern>
      format json
      time_key time
      time_type string
      time_format "%Y-%m-%dT%H:%M:%S.%NZ"
      keep_time_key true
    </pattern>

    <pattern>
      format /^(?<time>.+?) (?<stream>stdout|stderr) (?<logtag>[FP]) (?<log>.+)$/
      time_format "%Y-%m-%dT%H:%M:%S.%N%:z"
    </pattern>

  </parse>
  emit_unmatched_lines true
</source>

<filter kubernetes.**>
  @type concat
  key log
  partial_key logtag
  partial_value P
  separator ""
</filter>

In an environment where we have high volatility (we constantly deploy new code -> deployments are restarted very frequently -> pods are created -> lots of files to tail) we see a clear leak pattern. Like mentioned above this is coupled with a huge amount of:

 Skip update_watcher because watcher has been already updated by other inotify event path="/var/log/containers/...

CleanShot 2024-01-01 at 09 35 06@2x

Any suggestion to mitigate this? @ashie @daipom Maybe setting pos_file_compaction_interval to some value so we would stop watching rotated files? (of pods that no longer exist)

alexei-g-aloteq commented 10 months ago

BTW, any good alternatives for Fluentd? It does not seem to be reliable at the moment. This issue kept unfixed for years.

zmedico commented 10 months ago

I've observed this issue with v1.16.3. I suspect that in_tail handling of file rotations is unlikely to ever reach a satisfactory level of reliability, and something like the docker fluentd logging driver (which unfortunately breaks kubectl logs) is the only way to reliably avoid problems with log file rotation.

I suspect that my open_on_every_update true config might be making problems worse (not to mention that it introduces a race to collect log entries during log rotation), so I'm going to try disabling that to see if it helps.

zmedico commented 10 months ago

Setting open_on_every_update false did not seem to help for me. Interestingly, the issue did not appear to correlate with a log rotation, since the log file that it was not following was still older than the fluentd process. Fluentd was following the file through two levels of symlinks, and the symlink timestamps are older than the actual log file, so I don't know how to explain why it triggered the "unreadable" log message.

daipom commented 10 months ago

Thanks for reporting.

We are still seeing this issue as well. We are using this image v1.16.3-debian-forward-1.0.

We were seeing the following message If you keep getting this message, please restart Fluentd and like suggested here we changed follow_inodes to true and set rotate_wait to 0, but we are still seeing loads of Skip update_watcher because watcher has been already updated by other inotify event

We also noticed a pattern of memory leaking and gradual increase in CPU usage until a restart occurs. We are using fluentd as a daemonset on a kubernetes cluster. Here is our in_tail config:

So, there are still problems with both follow_inodes false and follow_inodes true, such as collection stops and resource leaks.

ashie commented 10 months ago

4326 would be a possible remaining cause of this issue.

On the other hand it might not be related with above additional reports because #4326 says no error log is found.

ashie commented 10 months ago

BTW it would be better to open a new issue to treat remaining things. This issue is too long and mixed various causes. It seems hard to collect & track informations to debug remaining causes in this issue.

uristernik commented 10 months ago

@ashie @daipom Thanks for the responses and attention.

Should I open a new issue for that?

If you have any suggestion I'll be more than happy to test different configs. I added the pos_file_compaction_interval 20m to the config but still seeing that Fluentd is leaking CPU


<source>
  @type tail
  @id in_tail_container_logs
  path /var/log/containers/*.log
  pos_file /var/log/fluentd-containers.log.pos
  tag kubernetes.*
  read_from_head true
  follow_inodes true
  rotate_wait 0
  exclude_path ["/var/log/containers/fluentd*.log", "/var/log/containers/*kube-system*.log", "/var/log/containers/*calico-system*.log", "/var/log/containers/prometheus-node-exporter*.log", "/var/log/containers/opentelemetry-agent*.log"]
  pos_file_compaction_interval 20m
  <parse>
    @type multi_format
    <pattern>
      format json
      time_key time
      time_type string
      time_format "%Y-%m-%dT%H:%M:%S.%NZ"
      keep_time_key true
    </pattern>
    <pattern>
      format /^(?<time>.+?) (?<stream>stdout|stderr) (?<logtag>[FP]) (?<log>.+)$/
      time_format "%Y-%m-%dT%H:%M:%S.%N%:z"
    </pattern>
  </parse>
  emit_unmatched_lines true
</source>

daipom commented 10 months ago

@uristernik Thanks. Could you please open a new issue?

uristernik commented 10 months ago

@daipom Done, hopefully I described the issue clear enough. Please correct me if I wasn't accurate enough

daipom commented 10 months ago

@uristernik Thanks! I will check it!

daipom commented 10 months ago

If you all still have a similar issue, please report it on the new issue!

fluent / fluentd

Fluentd in_tail "unreadable" file causes "following tail of <file>" to stop and no logs pushed #3614

Describe the bug

To Reproduce

Expected behavior

Your Environment

Your Configuration

Your Error Log

Additional context

4208

4208 is the replacement of #4185, isn't it?

4264

4326 would be a possible remaining cause of this issue.