aws / aws-for-fluent-bit

The source of the amazon/aws-for-fluent-bit container image
Apache License 2.0
461 stars 134 forks source link

Sporadic failures when attempting to send log events via cloudwatch_logs #872

Open amy-hashemi opened 1 week ago

amy-hashemi commented 1 week ago
### Describe the question/issue Fluentbit is sporadically failing to send log events to cloudwatch, reporting a variety of either tls errors, as well as http_client errors. When generating logs, Fluentbit will successfully send the first few collected chunks to cloudwatch, but will store chunks in memory and then hang for a seemingly random amount of time while it retries sending the aforementioned chunks. I've set up Fluent Bit to retry indefinitely so that logs are not lost. Also, each client that I have running FluentBit is making requests to a dedicated log stream (I have 10 log streams within a single group, with each stream getting logs sent to it from a different client). ### Configuration

Here's my configuration:

[SERVICE]
    flush 5
    daemon  on
    log_level trace
    log_file C:\logStorage\fluent-bit
    Parsers_File parsers.conf
    plugins_file plugins.conf
    http_server  Off
    http_listen  0.0.0.0
    http_port    2020
    storage.metrics on
    storage.path C:\logStorage
    storage.sync normal
    storage.checksum off
    storage.max_chunks_up 128
    storage.backlog.mem_limit 5M

[INPUT]
    Name tail
    Path C:\Program Files (x86)\LogFiles\*_Trace.trc
    Read_from_Head True
    Refresh_Interval 5
    Parser inputLog
    Path_Key path
    Tag inputLog
    storage.type filesystem
    Skip_Long_lines on
    Ignore_Older 3d
    DB tail.db

[FILTER]
    Name record_modifier
    Match *
    Record hostname ${HOSTNAME}

[OUTPUT]
    name cloudwatch_logs
    match *
    log_group_name /la/logs
    log_stream_name ${HOSTNAME}
    Retry_Limit False
    region us-west-2

Fluent Bit Log Output

I have a few possible outcomes when logs are attempting to be sent to cloudwatch:

The first chunk usually has no problem:

[2024/10/29 15:10:37] [debug] [output:cloudwatch_logs:cloudwatch_logs.0] cloudwatch:PutLogEvents: events=143, payload=54352 bytes
[2024/10/29 15:10:37] [debug] [output:cloudwatch_logs:cloudwatch_logs.0] Sending log events to log stream AB-PW01XJJZ
[2024/10/29 15:10:37] [debug] [upstream] KA connection #1232 to logs.us-west-2.amazonaws.com:443 is connected
[2024/10/29 15:10:37] [debug] [http_client] not using http_proxy for header
[2024/10/29 15:10:37] [debug] [upstream] KA connection #1232 to logs.us-west-2.amazonaws.com:443 is now available
[2024/10/29 15:10:37] [debug] [output:cloudwatch_logs:cloudwatch_logs.0] PutLogEvents http status=200
[2024/10/29 15:10:37] [debug] [task] destroy task=00000274BDCE4CE0 (task_id=2)
[2024/10/29 15:10:37] [debug] [out flush] cb_destroy coro_id=15
[2024/10/29 15:10:42] [debug] [input:tail:tail.0] 0 new files found on path 'C:\Program Files (x86)\LogFiles\*_Trace.trc'

Then as more data is collected, Fluent Bit starts to struggle sending events:

[2024/10/29 15:10:57] [debug] [output:cloudwatch_logs:cloudwatch_logs.0] Using stream=AB-PW01XJJZ, group=/la/clients
[2024/10/29 15:10:57] [debug] [output:cloudwatch_logs:cloudwatch_logs.0] cloudwatch:PutLogEvents: events=20, payload=6696 bytes
[2024/10/29 15:10:57] [debug] [output:cloudwatch_logs:cloudwatch_logs.0] Sending log events to log stream AB-PW01XJJZ
[2024/10/29 15:10:57] [debug] [upstream] KA connection #1232 to logs.us-west-2.amazonaws.com:443 has been assigned (recycled)
[2024/10/29 15:10:57] [debug] [http_client] not using http_proxy for header
[2024/10/29 15:10:57] [error] [http_client] broken connection to logs.us-west-2.amazonaws.com:443 ?
[2024/10/29 15:10:57] [debug] [aws_client] logs.us-west-2.amazonaws.com: http_do=0, HTTP Status: 0
[2024/10/29 15:10:57] [debug] [upstream] KA connection #1232 to logs.us-west-2.amazonaws.com:443 is now available
[2024/10/29 15:10:57] [debug] [output:cloudwatch_logs:cloudwatch_logs.0] PutLogEvents http status=0
[2024/10/29 15:10:57] [error] [output:cloudwatch_logs:cloudwatch_logs.0] Failed to send log events
[2024/10/29 15:10:57] [error] [output:cloudwatch_logs:cloudwatch_logs.0] Failed to send log events
[2024/10/29 15:10:57] [error] [output:cloudwatch_logs:cloudwatch_logs.0] Failed to send events
[2024/10/29 15:10:57] [debug] [out flush] cb_destroy coro_id=16
[2024/10/29 15:11:02] [debug] [input:tail:tail.0] 0 new files found on path 'C:\Program Files (x86)\LogFiles\*_Trace.trc'
[2024/10/29 15:11:06] [debug] [output:cloudwatch_logs:cloudwatch_logs.0] task_id=2 assigned to thread #0
[2024/10/29 15:11:06] [debug] [output:cloudwatch_logs:cloudwatch_logs.0] Using stream=AB-PW01XJJZ, group=/la/clients
[2024/10/29 15:11:06] [debug] [output:cloudwatch_logs:cloudwatch_logs.0] Using stream=AB-PW01XJJZ, group=/la/clients
[2024/10/29 15:11:06] [debug] [retry] re-using retry for task_id=2 attempts=2
[2024/10/29 15:11:06] [debug] [output:cloudwatch_logs:cloudwatch_logs.0] Using stream=AB-PW01XJJZ, group=/la/clients
[2024/10/29 15:11:06] [ warn] [engine] failed to flush chunk '26244-1730239853.255321700.flb', retry in 8 seconds: task_id=2, input=tail.0 > output=cloudwatch_logs.0 (out_id=0)

The chunk that FB tried to send above will hang out in the directory where I'm storing chunks (filesystem used as buffer), until FluentBit is finally able to retry enough times to be able to send it. Sometimes it needs to retry upwards of 9 or 10 times, with the retry counter climbing up into the 60s when I have many chunks in my filesystem buffer.

Then it will retry, either with the same chunk or a different one, and I'll get a message like this:

[2024/10/29 15:11:06] [debug] [output:cloudwatch_logs:cloudwatch_logs.0] Using stream=AB-PW01XJJZ, group=/la/clients
[2024/10/29 15:11:06] [debug] [output:cloudwatch_logs:cloudwatch_logs.0] cloudwatch:PutLogEvents: events=20, payload=6696 bytes
[2024/10/29 15:11:06] [debug] [output:cloudwatch_logs:cloudwatch_logs.0] Sending log events to log stream AB-PW01XJJZ
[2024/10/29 15:11:06] [debug] [upstream] KA connection #1232 to logs.us-west-2.amazonaws.com:443 has been assigned (recycled)
[2024/10/29 15:11:06] [debug] [http_client] not using http_proxy for header
[2024/10/29 15:11:06] [error] [tls] unknown error
[2024/10/29 15:11:06] [debug] [aws_client] logs.us-west-2.amazonaws.com: http_do=-1, HTTP Status: 0
[2024/10/29 15:11:06] [debug] [upstream] KA connection #1232 to logs.us-west-2.amazonaws.com:443 is now available
[2024/10/29 15:11:06] [debug] [aws_client] auto-retrying
[2024/10/29 15:11:06] [debug] [upstream] KA connection #1232 to logs.us-west-2.amazonaws.com:443 has been assigned (recycled)
[2024/10/29 15:11:06] [debug] [http_client] not using http_proxy for header
[2024/10/29 15:11:06] [error] [tls] unknown error
[2024/10/29 15:11:06] [debug] [aws_client] logs.us-west-2.amazonaws.com: http_do=-1, HTTP Status: 0
[2024/10/29 15:11:06] [debug] [upstream] KA connection #1232 to logs.us-west-2.amazonaws.com:443 is now available
[2024/10/29 15:11:06] [error] [output:cloudwatch_logs:cloudwatch_logs.0] Failed to send log events
[2024/10/29 15:11:06] [error] [output:cloudwatch_logs:cloudwatch_logs.0] Failed to send log events
[2024/10/29 15:11:06] [error] [output:cloudwatch_logs:cloudwatch_logs.0] Failed to send events
[2024/10/29 15:11:06] [debug] [out flush] cb_destroy coro_id=17
[2024/10/29 15:11:07] [debug] [input:tail:tail.0] 0 new files found on path 'C:\Program Files (x86)\LogFiles\*_Trace.trc'

And then something like this:

[2024/10/29 15:11:14] [debug] [upstream] KA connection #1232 to logs.us-west-2.amazonaws.com:443 has been assigned (recycled)
[2024/10/29 15:11:14] [debug] [http_client] not using http_proxy for header
[2024/10/29 15:11:14] [error] [tls] error: error:0A00007F:SSL routines::bad write retry
[2024/10/29 15:11:14] [debug] [aws_client] logs.us-west-2.amazonaws.com: http_do=-1, HTTP Status: 0
[2024/10/29 15:11:14] [debug] [upstream] KA connection #1232 to logs.us-west-2.amazonaws.com:443 is now available
[2024/10/29 15:11:14] [debug] [aws_client] auto-retrying
[2024/10/29 15:11:14] [debug] [upstream] KA connection #1232 to logs.us-west-2.amazonaws.com:443 has been assigned (recycled)
[2024/10/29 15:11:14] [debug] [http_client] not using http_proxy for header
[2024/10/29 15:11:14] [error] [tls] unknown error

And then finally it will succeed:

[2024/10/29 15:13:14] [debug] [output:cloudwatch_logs:cloudwatch_logs.0] Sending log events to log stream AB-PW01XJJZ
[2024/10/29 15:13:14] [debug] [upstream] KA connection #1208 to logs.us-west-2.amazonaws.com:443 is connected
[2024/10/29 15:13:14] [debug] [http_client] not using http_proxy for header
[2024/10/29 15:13:14] [debug] [upstream] KA connection #1208 to logs.us-west-2.amazonaws.com:443 is now available
[2024/10/29 15:13:14] [debug] [output:cloudwatch_logs:cloudwatch_logs.0] PutLogEvents http status=200
[2024/10/29 15:13:14] [ info] [engine] flush chunk '26244-1730239561.745338800.flb' succeeded at retry 7: task_id=0, input=tail.0 > output=cloudwatch_logs.0 (out_id=0)
[2024/10/29 15:13:14] [debug] [task] destroy task=00000274BDCE65E0 (task_id=0)
[2024/10/29 15:13:14] [debug] [out flush] cb_destroy coro_id=23

There doesn't seem to be any rhyme or reason why it will succeed sometimes while other times it won't. This is resulting in there being a high number of chunks buffered on the filesystem, with them not being flushed fast enough to cloudwatch. Even logs which are buffered in the morning one day still won't be flushed by the time I check the log stream the day after.

Some other errors that we see on our production instruments running Fluent Bit are below, which seem to be causing Fluent Bit to fail when flushing .flb chunks buffered on disk. These chunks range in size from 4KB to upwards of 150KB, so it doesn't seem to discriminate based on chunk size.

The errors below also seem to show up erratically:

[tls] error: error:00000006:lib(0)::reason(6)

[tls] error: error:00000001:lib(0)::reason(1)

Fluent Bit Version Info

We started on Fluent Bit version 2.2.2 but I just recently upgraded to version 3.1.9, with no improvement in behavior

Cluster Details

IPv4 protocol I don't think throttling should be an issue, as each client is making calls to an individual log stream. OS: Windows x64

Application Details

Sometimes we will encounter 10 lines per second, while other times we might reach up to 60 lines per second. Log lines can range between ~100 to ~400 bytes

Steps to reproduce issue

Related Issues

svnaresh commented 4 days ago

We are also facing the same issue with Fluentbit mentioned in the above logs.