fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.81k stars 1.58k forks source link

[1.7-dev] Filter plugin failed the input source #3065

Closed sophieyfang closed 3 years ago

sophieyfang commented 3 years ago

We use three filters to tweak the log tag name for syslog for Linux and eventlog for Windows. I just found out that the whole source of the logs(syslog or eventlog) don't get ingested if those three filters are there. This conf was working before. I turned on the log_level to debug, still the only error message is "[2021/02/11 19:19:01] [error] [input chunk] no available chunk". However, removing those three filters in the conf bring back the logs..

[FILTER]
    Name  modify
    Match default_pipeline.syslog
    Add   logName syslog

[FILTER]
    Name                  rewrite_tag
    Match                 default_pipeline.syslog
    Rule                  $logName .* $logName false
    Emitter_Storage.type  filesystem
    Emitter_Mem_Buf_Limit 10M

[FILTER]
    Name   modify
    Match  syslog
    Remove logName

Reproduce

Build agent from master branch(current:9c35c28f) or v1.7.0-rc5.

Use configuration (on any Linux distros, e.x. centos8 is the one i use):

[SERVICE]
    # https://docs.fluentbit.io/manual/administration/configuring-fluent-bit/configuration-file#config_section
    # Flush logs every 1 second, even if the buffer is not full to minimize log entry arrival delay.
    Flush      1
    # We use systemd to manage Fluent Bit instead.
    Daemon     off
    # Log_File is set by Fluent Bit systemd unit (e.g. /var/log/google-cloud-ops-agent/subagents/logging-module.log).
    Log_Level  info

    # https://docs.fluentbit.io/manual/administration/monitoring
    # Enable a built-in HTTP server that can be used to query internal information and monitor metrics of each running plugin.
    HTTP_Server  On
    HTTP_Listen  0.0.0.0
    HTTP_PORT    2020

    # https://docs.fluentbit.io/manual/administration/buffering-and-storage#service-section-configuration
    # storage.path is set by Fluent Bit systemd unit (e.g. /var/lib/google-cloud-ops-agent/fluent-bit/buffers).
    storage.sync               normal
    # Enable the data integrity check when writing and reading data from the filesystem.
    storage.checksum           on
    # The maximum amount of data to load into the memory when processing old chunks from the backlog that is from previous Fluent Bit processes (e.g. Fluent Bit may have crashed or restarted).
    storage.backlog.mem_limit  50M
    # Enable storage metrics in the built-in HTTP server.
    storage.metrics            on
    # This is exclusive to filesystem storage type. It specifies the number of chunks (every chunk is a file) that can be up in memory.
    # Every chunk is a file, so having it up in memory means having an open file descriptor. In case there are thousands of chunks,
    # we don't want them to all be loaded into the memory.
    storage.max_chunks_up      128

[INPUT]
    # https://docs.fluentbit.io/manual/pipeline/inputs/tail#config
    Name               tail
    DB                 /var/lib/google-cloud-ops-agent/fluent-bit/buffers/ops-agent-fluent-bit
    Path               /var/log/google-cloud-ops-agent/subagents/logging-module.log
    Tag                ops-agent-fluent-bit
    # Set the chunk limit conservatively to avoid exceeding the recommended chunk size of 5MB per write request.
    Buffer_Chunk_Size  512k
    # Set the max size a bit larger to accommodate for long log lines.
    Buffer_Max_Size    5M
    # When a message is unstructured (no parser applied), append it under a key named "message".
    Key                message
    # Increase this to 30 seconds so log rotations are handled more gracefully.
    Rotate_Wait        30
    # Skip long lines instead of skipping the entire file when a long line exceeds buffer size.
    Skip_Long_Lines    On

    # https://docs.fluentbit.io/manual/administration/buffering-and-storage#input-section-configuration
    # Buffer in disk to improve reliability.
    storage.type       filesystem

    # https://docs.fluentbit.io/manual/administration/backpressure#mem_buf_limit
    # This controls how much data the input plugin can hold in memory once the data is ingested into the core.
    # This is used to deal with backpressure scenarios (e.g: cannot flush data for some reason).
    # When the input plugin hits "mem_buf_limit", because we have enabled filesystem storage type, mem_buf_limit acts
    # as a hint to set "how much data can be up in memory", once the limit is reached it continues writing to disk.
    Mem_Buf_Limit      10M

[INPUT]
    # https://docs.fluentbit.io/manual/pipeline/inputs/tail#config
    Name               tail
    DB                 /var/lib/google-cloud-ops-agent/fluent-bit/buffers/ops-agent-collectd
    Path               /var/log/google-cloud-ops-agent/subagents/metrics-module.log
    Tag                ops-agent-collectd
    # Set the chunk limit conservatively to avoid exceeding the recommended chunk size of 5MB per write request.
    Buffer_Chunk_Size  512k
    # Set the max size a bit larger to accommodate for long log lines.
    Buffer_Max_Size    5M
    # When a message is unstructured (no parser applied), append it under a key named "message".
    Key                message
    # Increase this to 30 seconds so log rotations are handled more gracefully.
    Rotate_Wait        30
    # Skip long lines instead of skipping the entire file when a long line exceeds buffer size.
    Skip_Long_Lines    On

    # https://docs.fluentbit.io/manual/administration/buffering-and-storage#input-section-configuration
    # Buffer in disk to improve reliability.
    storage.type       filesystem

    # https://docs.fluentbit.io/manual/administration/backpressure#mem_buf_limit
    # This controls how much data the input plugin can hold in memory once the data is ingested into the core.
    # This is used to deal with backpressure scenarios (e.g: cannot flush data for some reason).
    # When the input plugin hits "mem_buf_limit", because we have enabled filesystem storage type, mem_buf_limit acts
    # as a hint to set "how much data can be up in memory", once the limit is reached it continues writing to disk.
    Mem_Buf_Limit      10M

[INPUT]
    # https://docs.fluentbit.io/manual/pipeline/inputs/tail#config
    Name               tail
    DB                 /var/lib/google-cloud-ops-agent/fluent-bit/buffers/default_pipeline_syslog
    Path               /var/log/messages,/var/log/syslog
    Tag                default_pipeline.syslog
    # Set the chunk limit conservatively to avoid exceeding the recommended chunk size of 5MB per write request.
    Buffer_Chunk_Size  512k
    # Set the max size a bit larger to accommodate for long log lines.
    Buffer_Max_Size    5M
    # When a message is unstructured (no parser applied), append it under a key named "message".
    Key                message
    # Increase this to 30 seconds so log rotations are handled more gracefully.
    Rotate_Wait        30
    # Skip long lines instead of skipping the entire file when a long line exceeds buffer size.
    Skip_Long_Lines    On

    # https://docs.fluentbit.io/manual/administration/buffering-and-storage#input-section-configuration
    # Buffer in disk to improve reliability.
    storage.type       filesystem

    # https://docs.fluentbit.io/manual/administration/backpressure#mem_buf_limit
    # This controls how much data the input plugin can hold in memory once the data is ingested into the core.
    # This is used to deal with backpressure scenarios (e.g: cannot flush data for some reason).
    # When the input plugin hits "mem_buf_limit", because we have enabled filesystem storage type, mem_buf_limit acts
    # as a hint to set "how much data can be up in memory", once the limit is reached it continues writing to disk.
    Mem_Buf_Limit      10M

[FILTER]
    Name  modify
    Match default_pipeline.syslog
    Add   logName syslog

[FILTER]
    Name                  rewrite_tag
    Match                 default_pipeline.syslog
    Rule                  $logName .* $logName false
    Emitter_Storage.type  filesystem
    Emitter_Mem_Buf_Limit 10M

[FILTER]
    Name   modify
    Match  syslog
    Remove logName

[OUTPUT]
    # https://docs.fluentbit.io/manual/pipeline/outputs/stackdriver
    Name     stackdriver
    resource gce_instance
    Match    ops-agent-fluent-bit

    # https://docs.fluentbit.io/manual/administration/scheduling-and-retries
    # After 3 retries, a given chunk will be discarded. So bad entries don't accidentally stay around forever.
    Retry_Limit  3

    # https://docs.fluentbit.io/manual/administration/security
    # Enable TLS support.
    tls         On
    # Do not force certificate validation.
    tls.verify  Off

[OUTPUT]
    # https://docs.fluentbit.io/manual/pipeline/outputs/stackdriver
    Name     stackdriver
    resource gce_instance
    Match    ops-agent-collectd

    # https://docs.fluentbit.io/manual/administration/scheduling-and-retries
    # After 3 retries, a given chunk will be discarded. So bad entries don't accidentally stay around forever.
    Retry_Limit  3

    # https://docs.fluentbit.io/manual/administration/security
    # Enable TLS support.
    tls         On
    # Do not force certificate validation.
    tls.verify  Off

[OUTPUT]
    # https://docs.fluentbit.io/manual/pipeline/outputs/stackdriver
    Name     stackdriver
    resource gce_instance
    Match    syslog

    # https://docs.fluentbit.io/manual/administration/scheduling-and-retries
    # After 3 retries, a given chunk will be discarded. So bad entries don't accidentally stay around forever.
    Retry_Limit  3

    # https://docs.fluentbit.io/manual/administration/security
    # Enable TLS support.
    tls         On
    # Do not force certificate validation.
    tls.verify  Off
edsiper commented 3 years ago

Config Viewer

viewer

permalink: https://link.calyptia.com/ax4

Troubleshooting

Looking at the filter that uses rewrite tag, the new emitted records will have the same tag syslog, so it will continuosly matching the rewrite tool. Note that when rewrite_tag emits a record, this records go through all the beginning of the pipeline, so it's suggested that your new tag has a different prefix, e.g:

[FILTER]
    Name                  rewrite_tag
    Match                 default_pipeline.syslog
    Rule                  $logName .* new.$logName false
    Emitter_Storage.type  filesystem
    Emitter_Mem_Buf_Limit 10M

[FILTER]
    Name   modify
    Match  new.syslog
    Remove logName
sophieyfang commented 3 years ago

After going thru rewrite_tag plugin the tag name is changed from default_pipeline.syslog to syslog. So why will it go thru rewrite_tag plugin again?

The expected final log tag from client side shall be in syslog andwindows_event_log. Also we use those confs and they were working before. 

edsiper commented 3 years ago

sorry about that. I was able to reproduce it with a minimized config:

[SERVICE]
    flush     1
    log_level info

[INPUT]
    name  dummy
    dummy {"key": "test"}
    tag   default_pipeline.syslog

[FILTER]
    Name  modify
    Match default_pipeline.syslog
    Add   logName syslog

[FILTER]
    name  stdout
    match *

[FILTER]
    Name                  rewrite_tag
    Match                 default_pipeline.syslog
    Rule                  $logName .* $logName false

[FILTER]
    Name   modify
    Match  syslog
    Remove logName

[OUTPUT]
    name   stdout
    match  syslog

the errors:

$ ../../build/bin/fluent-bit -c conf 
Fluent Bit v1.7.0
* Copyright (C) 2019-2020 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2021/02/11 15:14:51] [ info] [engine] started (pid=1187409)
[2021/02/11 15:14:51] [ info] [storage] version=1.1.0, initializing...
[2021/02/11 15:14:51] [ info] [storage] in-memory
[2021/02/11 15:14:51] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2021/02/11 15:14:51] [ info] [sp] stream processor started
[2021/02/11 15:14:51] [error] [input chunk] no available chunk
[2021/02/11 15:14:52] [error] [input chunk] no available chunk
sophieyfang commented 3 years ago

I looked at code in https://github.com/fluent/fluent-bit/tree/master/plugins/filter_rewrite_tag and https://github.com/fluent/fluent-bit/tree/master/plugins/filter_modify. It seems no change since last year.. maybe it's the order how those filters are executed got changed recently.. 🤷‍♀️

Thank you Eduardo. Much appreciated for picking up this bug!

edsiper commented 3 years ago

this issue was generated by another PR.

@JeffLuoo can you help us to troubleshoot this issue ? this is a blocker for the release and was introduced on this commit:

https://github.com/fluent/fluent-bit/commit/44e9e9df79e55ba6f078849a20cff1fa06b27734

This is the line triggering the problem:

https://github.com/fluent/fluent-bit/blob/master/src/flb_input_chunk.c#L561

JeffLuoo commented 3 years ago

@edsiper I will take a look.

edsiper commented 3 years ago

@sophieyfang I've pushed a temporal branch that reverts #2882

@JeffLuoo if you can get this sorted out would be awesome :)

edsiper commented 3 years ago

adding @jlunavtgrad for visibility

JeffLuoo commented 3 years ago

Some update... After adding some printout to the code and I found that the new tag is not passed to the function https://github.com/fluent/fluent-bit/blob/e34f061396f95955af828049feaa0baedfb60a47/src/flb_routes_mask.c#L32

* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2021/02/11 22:56:31] [ info] [engine] started (pid=3756777)
[2021/02/11 22:56:31] [ info] [storage] version=1.1.0, initializing...
[2021/02/11 22:56:31] [ info] [storage] in-memory
[2021/02/11 22:56:31] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2021/02/11 22:56:31] [ info] [sp] stream processor started
[2021/02/11 22:56:31] [ info] The tag now is default_pipeline.syslog
[2021/02/11 22:56:31] [ info] [input chunk] no matching route for input chunk '3756777-1613084191.882263195.flb' with tag 'default_pipeline.syslog'
[2021/02/11 22:56:31] [ info] flb_routes_mask_is_empty returns 1
[2021/02/11 22:56:31] [error] [input chunk] no available chunk
jlunavtgrad commented 3 years ago

I'm not sure if my changes are the root cause. I backdated my build to Feb 1st, before my changes were merged, and I wasn't able to get tag rewriting to work.

fujimotos commented 3 years ago

FWIW, I think @JeffLuoo is on the point. The input plugin seems to be discarding data because it thinks there is no corresponding output plugin for these chunks.

Evidently it does not notice the existence of out_stackdriver, who is listening to the stream though filter_rewrite_tag.

I posted a hot fix #3071 which essentially brings back the conditional statement that was removed here. Probably there may be a wiser fix, but I can confirm that this patch resolves the failure case @sophieyfang reported.

edsiper commented 3 years ago

Fixed in #3071