Losing logs from fluent-bit to fluentd cluster during brief outages on the fluentd cluster

dashe-ops commented 1 year ago

Describe the bug

Hi,

a log ship solution, we are using fluent-bit on client VM's and sending logs to a 3 node fluentd cluster.

if I stop all 3 nodes in fluentd cluster at same time say for example for 2 minutes then restart fluentd on all 3 nodes. When i check the shipped logs we are missing 60 seconds of logs from the 2 minutes offline period.

To Reproduce

writing a simple log to test, print the date every 1 second:

while sleep 1; do date; done > /tmp/test.log

on fluentd cluster stop the cluster (stop all 3 nodes at same time)

untar the test logfile and see timestamp of last log line

wait 2 minutes and restart the fluentd cluster

wait for new log from client to appear and untar and read first few lines.

if buffers worked as we expect there should be no lost data

everytime there is lost data

tail -5 ie1-abc01b-nxt.nxt.test-test_20230323_02a.log Thu Mar 23 15:56:51 UTC 2023 Thu Mar 23 15:56:52 UTC 2023 Thu Mar 23 15:56:53 UTC 2023 Thu Mar 23 15:56:54 UTC 2023 Thu Mar 23 15:56:55 UTC 2023

head ie1-abc01b-nxt.nxt.test-test_20230323_03a.log

Thu Mar 23 15:57:57 UTC 2023 Thu Mar 23 15:57:58 UTC 2023 Thu Mar 23 15:57:59 UTC 2023 Thu Mar 23 15:58:00 UTC 2023 Thu Mar 23 15:58:01 UTC 2023

in above example we've lost 1 minutes data

Expected behavior

if buffers worked as we expect there should be no lost data

Your Environment

- Fluentd version:4.4.2
- TD Agent version: 2.0.6
- Operating system:Centos7
- Kernel version: 3.10.0-1160.83.1.el7.x86_64

Your Configuration

client fluent-bit configuration:
logship-fluent-bit.conf

[SERVICE]
    # Flush
    # =====
    # set an interval of seconds before to flush records to a destination
    flush        30
[INPUT]
        name tail
        path /tmp/test.log
        path_key log_file
        tag i2.2y.default.sgb.${HOSTNAME}.<filename>
        tag_regex (\/.*\/)(?<filename>.+)
        Storage.type memory
        DB /var/log/logship/buffer/tail-0.db
        DB.locking true
        DB.journal_mode WAL
[OUTPUT]
        Name forward
        Match *
        Host ie1-logship-nxt.nxt.endpoint
        Port 80
        Compress gzip

And on fluentd cluster the config :
<system>
    workers 1
    rpc_endpoint 0.0.0.0:24724
</system>

<source>
    @type forward
    port 24224
    @id forward
</source>

<match i2**>
    @type file
    @id file
    compress gzip
    path /data/${tag[1]}/%Y/%m/%d/${tag[2]}/${tag[3]}/${tag[4]}.${tag[5]}.${tag[6]}-${tag[7]}_%Y%m%d_03a
    append
    <buffer tag,time>
        @type memory
        flush_thread_count 8
        chunk_limit_size 8M
        queue_limit_length 64
        retry_max_interval 30
        retry_max_times 1000
        flush_mode interval
        flush_interval 30s
    </buffer>
    <format>
        @type single_value
        message_key log
    </format>
</match>

<source>
@type monitor_agent
bind 0.0.0.0
port 24220
@id monitor_agent
</source>

Your Error Log

no error in logs, just missing data

Additional context

No response

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has been open 30 days with no activity. Remove stale label or comment or this issue will be closed in 7 days

Garfield96 commented 1 year ago

Hi @dashe-ops, I assume the issue is caused by your Fluent-Bit configuration. You can set the Require_ack_response option in the forward output (https://docs.fluentbit.io/manual/pipeline/outputs/forward) for improved reliability. In addition, you should make sure that Fluent-Bit doesn't drop log messages because the maximum number of retries was reached. The configuration of retries is described here https://docs.fluentbit.io/manual/administration/scheduling-and-retries#configuring-retries.

fluent / fluentd