### Describe the question/issue
Fluentbit is sporadically failing to send log events to cloudwatch, reporting a variety of either tls errors, as well as http_client errors. When generating logs, Fluentbit will successfully send the first few collected chunks to cloudwatch, but will store chunks in memory and then hang for a seemingly random amount of time while it retries sending the aforementioned chunks.
I've set up Fluent Bit to retry indefinitely so that logs are not lost. Also, each client that I have running FluentBit is making requests to a dedicated log stream (I have 10 log streams within a single group, with each stream getting logs sent to it from a different client).
### Configuration
Here's my configuration:
[SERVICE]
flush 5
daemon on
log_level trace
log_file C:\logStorage\fluent-bit
Parsers_File parsers.conf
plugins_file plugins.conf
http_server Off
http_listen 0.0.0.0
http_port 2020
storage.metrics on
storage.path C:\logStorage
storage.sync normal
storage.checksum off
storage.max_chunks_up 128
storage.backlog.mem_limit 5M
[INPUT]
Name tail
Path C:\Program Files (x86)\LogFiles\*_Trace.trc
Read_from_Head True
Refresh_Interval 5
Parser inputLog
Path_Key path
Tag inputLog
storage.type filesystem
Skip_Long_lines on
Ignore_Older 3d
DB tail.db
[FILTER]
Name record_modifier
Match *
Record hostname ${HOSTNAME}
[OUTPUT]
name cloudwatch_logs
match *
log_group_name /la/logs
log_stream_name ${HOSTNAME}
Retry_Limit False
region us-west-2
Fluent Bit Log Output
I have a few possible outcomes when logs are attempting to be sent to cloudwatch:
The first chunk usually has no problem:
[2024/10/29 15:10:37] [debug] [output:cloudwatch_logs:cloudwatch_logs.0] cloudwatch:PutLogEvents: events=143, payload=54352 bytes
[2024/10/29 15:10:37] [debug] [output:cloudwatch_logs:cloudwatch_logs.0] Sending log events to log stream AB-PW01XJJZ
[2024/10/29 15:10:37] [debug] [upstream] KA connection #1232 to logs.us-west-2.amazonaws.com:443 is connected
[2024/10/29 15:10:37] [debug] [http_client] not using http_proxy for header
[2024/10/29 15:10:37] [debug] [upstream] KA connection #1232 to logs.us-west-2.amazonaws.com:443 is now available
[2024/10/29 15:10:37] [debug] [output:cloudwatch_logs:cloudwatch_logs.0] PutLogEvents http status=200
[2024/10/29 15:10:37] [debug] [task] destroy task=00000274BDCE4CE0 (task_id=2)
[2024/10/29 15:10:37] [debug] [out flush] cb_destroy coro_id=15
[2024/10/29 15:10:42] [debug] [input:tail:tail.0] 0 new files found on path 'C:\Program Files (x86)\LogFiles\*_Trace.trc'
Then as more data is collected, Fluent Bit starts to struggle sending events:
[2024/10/29 15:10:57] [debug] [output:cloudwatch_logs:cloudwatch_logs.0] Using stream=AB-PW01XJJZ, group=/la/clients
[2024/10/29 15:10:57] [debug] [output:cloudwatch_logs:cloudwatch_logs.0] cloudwatch:PutLogEvents: events=20, payload=6696 bytes
[2024/10/29 15:10:57] [debug] [output:cloudwatch_logs:cloudwatch_logs.0] Sending log events to log stream AB-PW01XJJZ
[2024/10/29 15:10:57] [debug] [upstream] KA connection #1232 to logs.us-west-2.amazonaws.com:443 has been assigned (recycled)
[2024/10/29 15:10:57] [debug] [http_client] not using http_proxy for header
[2024/10/29 15:10:57] [error] [http_client] broken connection to logs.us-west-2.amazonaws.com:443 ?
[2024/10/29 15:10:57] [debug] [aws_client] logs.us-west-2.amazonaws.com: http_do=0, HTTP Status: 0
[2024/10/29 15:10:57] [debug] [upstream] KA connection #1232 to logs.us-west-2.amazonaws.com:443 is now available
[2024/10/29 15:10:57] [debug] [output:cloudwatch_logs:cloudwatch_logs.0] PutLogEvents http status=0
[2024/10/29 15:10:57] [error] [output:cloudwatch_logs:cloudwatch_logs.0] Failed to send log events
[2024/10/29 15:10:57] [error] [output:cloudwatch_logs:cloudwatch_logs.0] Failed to send log events
[2024/10/29 15:10:57] [error] [output:cloudwatch_logs:cloudwatch_logs.0] Failed to send events
[2024/10/29 15:10:57] [debug] [out flush] cb_destroy coro_id=16
[2024/10/29 15:11:02] [debug] [input:tail:tail.0] 0 new files found on path 'C:\Program Files (x86)\LogFiles\*_Trace.trc'
[2024/10/29 15:11:06] [debug] [output:cloudwatch_logs:cloudwatch_logs.0] task_id=2 assigned to thread #0
[2024/10/29 15:11:06] [debug] [output:cloudwatch_logs:cloudwatch_logs.0] Using stream=AB-PW01XJJZ, group=/la/clients
[2024/10/29 15:11:06] [debug] [output:cloudwatch_logs:cloudwatch_logs.0] Using stream=AB-PW01XJJZ, group=/la/clients
[2024/10/29 15:11:06] [debug] [retry] re-using retry for task_id=2 attempts=2
[2024/10/29 15:11:06] [debug] [output:cloudwatch_logs:cloudwatch_logs.0] Using stream=AB-PW01XJJZ, group=/la/clients
[2024/10/29 15:11:06] [ warn] [engine] failed to flush chunk '26244-1730239853.255321700.flb', retry in 8 seconds: task_id=2, input=tail.0 > output=cloudwatch_logs.0 (out_id=0)
The chunk that FB tried to send above will hang out in the directory where I'm storing chunks (filesystem used as buffer), until FluentBit is finally able to retry enough times to be able to send it. Sometimes it needs to retry upwards of 9 or 10 times, with the retry counter climbing up into the 60s when I have many chunks in my filesystem buffer.
Then it will retry, either with the same chunk or a different one, and I'll get a message like this:
[2024/10/29 15:11:06] [debug] [output:cloudwatch_logs:cloudwatch_logs.0] Using stream=AB-PW01XJJZ, group=/la/clients
[2024/10/29 15:11:06] [debug] [output:cloudwatch_logs:cloudwatch_logs.0] cloudwatch:PutLogEvents: events=20, payload=6696 bytes
[2024/10/29 15:11:06] [debug] [output:cloudwatch_logs:cloudwatch_logs.0] Sending log events to log stream AB-PW01XJJZ
[2024/10/29 15:11:06] [debug] [upstream] KA connection #1232 to logs.us-west-2.amazonaws.com:443 has been assigned (recycled)
[2024/10/29 15:11:06] [debug] [http_client] not using http_proxy for header
[2024/10/29 15:11:06] [error] [tls] unknown error
[2024/10/29 15:11:06] [debug] [aws_client] logs.us-west-2.amazonaws.com: http_do=-1, HTTP Status: 0
[2024/10/29 15:11:06] [debug] [upstream] KA connection #1232 to logs.us-west-2.amazonaws.com:443 is now available
[2024/10/29 15:11:06] [debug] [aws_client] auto-retrying
[2024/10/29 15:11:06] [debug] [upstream] KA connection #1232 to logs.us-west-2.amazonaws.com:443 has been assigned (recycled)
[2024/10/29 15:11:06] [debug] [http_client] not using http_proxy for header
[2024/10/29 15:11:06] [error] [tls] unknown error
[2024/10/29 15:11:06] [debug] [aws_client] logs.us-west-2.amazonaws.com: http_do=-1, HTTP Status: 0
[2024/10/29 15:11:06] [debug] [upstream] KA connection #1232 to logs.us-west-2.amazonaws.com:443 is now available
[2024/10/29 15:11:06] [error] [output:cloudwatch_logs:cloudwatch_logs.0] Failed to send log events
[2024/10/29 15:11:06] [error] [output:cloudwatch_logs:cloudwatch_logs.0] Failed to send log events
[2024/10/29 15:11:06] [error] [output:cloudwatch_logs:cloudwatch_logs.0] Failed to send events
[2024/10/29 15:11:06] [debug] [out flush] cb_destroy coro_id=17
[2024/10/29 15:11:07] [debug] [input:tail:tail.0] 0 new files found on path 'C:\Program Files (x86)\LogFiles\*_Trace.trc'
And then something like this:
[2024/10/29 15:11:14] [debug] [upstream] KA connection #1232 to logs.us-west-2.amazonaws.com:443 has been assigned (recycled)
[2024/10/29 15:11:14] [debug] [http_client] not using http_proxy for header
[2024/10/29 15:11:14] [error] [tls] error: error:0A00007F:SSL routines::bad write retry
[2024/10/29 15:11:14] [debug] [aws_client] logs.us-west-2.amazonaws.com: http_do=-1, HTTP Status: 0
[2024/10/29 15:11:14] [debug] [upstream] KA connection #1232 to logs.us-west-2.amazonaws.com:443 is now available
[2024/10/29 15:11:14] [debug] [aws_client] auto-retrying
[2024/10/29 15:11:14] [debug] [upstream] KA connection #1232 to logs.us-west-2.amazonaws.com:443 has been assigned (recycled)
[2024/10/29 15:11:14] [debug] [http_client] not using http_proxy for header
[2024/10/29 15:11:14] [error] [tls] unknown error
And then finally it will succeed:
[2024/10/29 15:13:14] [debug] [output:cloudwatch_logs:cloudwatch_logs.0] Sending log events to log stream AB-PW01XJJZ
[2024/10/29 15:13:14] [debug] [upstream] KA connection #1208 to logs.us-west-2.amazonaws.com:443 is connected
[2024/10/29 15:13:14] [debug] [http_client] not using http_proxy for header
[2024/10/29 15:13:14] [debug] [upstream] KA connection #1208 to logs.us-west-2.amazonaws.com:443 is now available
[2024/10/29 15:13:14] [debug] [output:cloudwatch_logs:cloudwatch_logs.0] PutLogEvents http status=200
[2024/10/29 15:13:14] [ info] [engine] flush chunk '26244-1730239561.745338800.flb' succeeded at retry 7: task_id=0, input=tail.0 > output=cloudwatch_logs.0 (out_id=0)
[2024/10/29 15:13:14] [debug] [task] destroy task=00000274BDCE65E0 (task_id=0)
[2024/10/29 15:13:14] [debug] [out flush] cb_destroy coro_id=23
There doesn't seem to be any rhyme or reason why it will succeed sometimes while other times it won't. This is resulting in there being a high number of chunks buffered on the filesystem, with them not being flushed fast enough to cloudwatch. Even logs which are buffered in the morning one day still won't be flushed by the time I check the log stream the day after.
Some other errors that we see on our production instruments running Fluent Bit are below, which seem to be causing Fluent Bit to fail when flushing .flb chunks buffered on disk. These chunks range in size from 4KB to upwards of 150KB, so it doesn't seem to discriminate based on chunk size.
The errors below also seem to show up erratically:
We started on Fluent Bit version 2.2.2 but I just recently upgraded to version 3.1.9, with no improvement in behavior
Cluster Details
IPv4 protocol
I don't think throttling should be an issue, as each client is making calls to an individual log stream.
OS: Windows x64
Application Details
Sometimes we will encounter 10 lines per second, while other times we might reach up to 60 lines per second.
Log lines can range between ~100 to ~400 bytes
Here's my configuration:
Fluent Bit Log Output
I have a few possible outcomes when logs are attempting to be sent to cloudwatch:
The first chunk usually has no problem:
Then as more data is collected, Fluent Bit starts to struggle sending events:
The chunk that FB tried to send above will hang out in the directory where I'm storing chunks (filesystem used as buffer), until FluentBit is finally able to retry enough times to be able to send it. Sometimes it needs to retry upwards of 9 or 10 times, with the retry counter climbing up into the 60s when I have many chunks in my filesystem buffer.
Then it will retry, either with the same chunk or a different one, and I'll get a message like this:
And then something like this:
And then finally it will succeed:
There doesn't seem to be any rhyme or reason why it will succeed sometimes while other times it won't. This is resulting in there being a high number of chunks buffered on the filesystem, with them not being flushed fast enough to cloudwatch. Even logs which are buffered in the morning one day still won't be flushed by the time I check the log stream the day after.
Some other errors that we see on our production instruments running Fluent Bit are below, which seem to be causing Fluent Bit to fail when flushing .flb chunks buffered on disk. These chunks range in size from 4KB to upwards of 150KB, so it doesn't seem to discriminate based on chunk size.
The errors below also seem to show up erratically:
Fluent Bit Version Info
We started on Fluent Bit version 2.2.2 but I just recently upgraded to version 3.1.9, with no improvement in behavior
Cluster Details
IPv4 protocol I don't think throttling should be an issue, as each client is making calls to an individual log stream. OS: Windows x64
Application Details
Sometimes we will encounter 10 lines per second, while other times we might reach up to 60 lines per second. Log lines can range between ~100 to ~400 bytes
Steps to reproduce issue
Related Issues