fluent / fluentd

Fluentd: Unified Logging Layer (project under CNCF)
https://www.fluentd.org
Apache License 2.0
12.9k stars 1.34k forks source link

Fluentd agent completely stuck due to printing "pattern not match warning" #3949

Open kashok7474 opened 1 year ago

kashok7474 commented 1 year ago

Describe the bug

Hi Team,

We observed that when Fluentd agent is overwhelmed with "pattern not matched:" warning after some time it gets stuck and stopped processing. Both tailing and flushing buffers are stuck.

To Reproduce

To reproduce this issue we run an experiment on shared configuration.

Case1:
Fluentd agent running fine and able to process incoming logs file heartbeat.log.
Case 2:
In addition to case 1, we started appending logs to file pattern_not_match.log which will result in pattern not match warning and overwhelmed files with new logs. After some time agent got stuck and stopped processing logs from the heartbeat.log file.
we stopped adding log line to pattern_not_match.log but the agent didn't recover from the stuck state.
Case3: manually commented https://github.com/fluent/fluentd/blob/master/lib/fluent/plugin/in_tail.rb#L640 to stop printing pattern not match warning, and restarted agent. and repeated case 2. The agent was running fine and pattern not warning not impacting fluentd now.

For each case, the sig dump is attached.

Screenshot 2022-11-07 at 3 50 38 PM

Case 1 : 6:55 AM to 8:10AM Case 2: 8:10 AM to 9:15AM case 3 : 9:15 AM to continue

Sigdump atteched below. case 1

[root@kashokum-downloader tmp]# ps -ef | grep unified
root      2534     1  0 06:40 ?        00:00:00 /opt/unified-monitoring-agent/embedded/bin/ruby /opt/unified-monitoring-agent/embedded/bin/fluentd --log /var/log/unified-monitoring-agent/unified-monitoring-agent.log --daemon /var/run/unified-monitoring-agent/unified-monitoring-agent.pid --log-rotate-size 1048576 --log-rotate-age 10
root      2537  2534  0 06:40 ?        00:00:02 /opt/unified-monitoring-agent/embedded/bin/ruby -Eascii-8bit:ascii-8bit /opt/unified-monitoring-agent/embedded/bin/fluentd --log /var/log/unified-monitoring-agent/unified-monitoring-agent.log --daemon /var/run/unified-monitoring-agent/unified-monitoring-agent.pid --log-rotate-size 1048576 --log-rotate-age 10 --under-supervisor

sigdump-2534.log

sigdump-2537.log

case 2

[root@kashokum-downloader opt]# ps -ef | grep unified
root     22200     1  0 08:10 ?        00:00:00 /opt/unified-monitoring-agent/embedded/bin/ruby /opt/unified-monitoring-agent/embedded/bin/fluentd --log /var/log/unified-monitoring-agent/unified-monitoring-agent.log --daemon /var/run/unified-monitoring-agent/unified-monitoring-agent.pid --log-rotate-size 1048576 --log-rotate-age 10
root     22203 22200 39 08:10 ?        00:21:00 /opt/unified-monitoring-agent/embedded/bin/ruby -Eascii-8bit:ascii-8bit /opt/unified-monitoring-agent/embedded/bin/fluentd --log /var/log/unified-monitoring-agent/unified-monitoring-agent.log --daemon /var/run/unified-monitoring-agent/unified-monitoring-agent.pid --log-rotate-size 1048576 --log-rotate-age 10 --under-supervisor

sigdump-22200.log sigdump-22203.log

case 3

[root@kashokum-downloader tmp]# ps -ef | grep unified
root      3975     1  0 09:15 ?        00:00:00 /opt/unified-monitoring-agent/embedded/bin/ruby /opt/unified-monitoring-agent/embedded/bin/fluentd --log /var/log/unified-monitoring-agent/unified-monitoring-agent.log --daemon /var/run/unified-monitoring-agent/unified-monitoring-agent.pid --log-rotate-size 1048576 --log-rotate-age 10
root      3978  3975  8 09:15 ?        00:05:46 /opt/unified-monitoring-agent/embedded/bin/ruby -Eascii-8bit:ascii-8bit /opt/unified-monitoring-agent/embedded/bin/fluentd --log /var/log/unified-monitoring-agent/unified-monitoring-agent.log --daemon /var/run/unified-monitoring-agent/unified-monitoring-agent.pid --log-rotate-size 1048576 --log-rotate-age 10 --under-supervisor

sigdump-3975.log sigdump-3978.log

Expected behavior

The agent should not go into a stuck state. Why print log line with pattern not match warning resulting in this issue? @ashie can you look into this? https://github.com/fluent/fluentd/blob/master/lib/fluent/plugin/in_tail.rb#L640

Your Environment

- Fluentd version: 1.14.4
- TD Agent version:
- Operating system: Linux
- Kernel version:

Your Configuration

<source>
     @type tail
     tag 6c6561.ashoklinux2heartbeat
     path  /var/log/heartbeat/hosts/*/heartbeat.log
     pos_file  /etc/unifiedmonitoringagent/pos/6c6561-ashoklinux2heartbeat.pos
     path_key  tailed_path
     <parse>
         @type none
         types  {}
         keep_time_key  false
         null_empty_string  false
         estimate_current_event  true
         timeout  0
     </parse>
 </source>

 <match 6c6561.**>
     @type xxx-xxx
     log_object_id  testid
     <buffer tag>
         @type file
         retry_timeout  3h
         path  /opt/unifiedmonitoringagent/run/buffer/6c6561
         disable_chunk_backup  true
         chunk_limit_size  5MB
         flush_interval  60s
         total_limit_size  1GB
         overflow_action  throw_exception
         retry_type  exponential_backoff
     </buffer>
 </match>
 <source>
     @type tail
     tag 687861.patternnotmatch
     path  /var/log/pattern_not_match/pattern_not_match.log
     pos_file  /etc/unifiedmonitoringagent/pos/687861-patternnotmatch.pos
     path_key  tailed_path
     <parse>
         @type none
         types  {}
         keep_time_key  false
         null_empty_string  false
         estimate_current_event  false
         timeout  0
     </parse>
 </source>

 <match 687861.**>
     @type oci_logging
     log_object_id  ocid1.log.oc1.phx.amaaaaaa4p7m3syaoyyu65n7x5l6tgcjd3bls74vhlwl6pumxooizrax3g3q
     <buffer tag>
         @type file
         retry_timeout  3h
         path  /opt/unifiedmonitoringagent/run/buffer/687861
         disable_chunk_backup  true
         chunk_limit_size  5MB
         flush_interval  180s
         total_limit_size  1GB
         overflow_action  throw_exception
         retry_type  exponential_backoff
     </buffer>
 </match>

Your Error Log

No error log, sigdump attached

Additional context

No response

fujimotos commented 1 year ago

I don't fully trace the code path yet, but I think you can avoid the particular issue by removing timeout parameter here:

 <parse>
     @type none
    ...
     timeout  0
 </parse>

This seems to be a bug of Fluentd (or maybe `ServeEngine::BlockingFlag). Evidently a parser exception can cause a deadlock here:

https://github.com/fluent/fluentd/blob/master/lib/fluent/plugin/parser.rb#L43

kashok7474 commented 1 year ago

@fujimotos thanks for the update, i will try removing timeout 0 but printing record with pattern not match warning causing this issue.