Open zvia-eiger-hs opened 1 year ago
Occasionally we also see log loss:
So as you saw in the guide, network connection issues happen occasionally, but when they cause log loss that's when it becomes a problem.
How high of throughput are you sending at? Do you ever get throttle exceptions from firehose?
As noted in that guide:
One of the simplest causes of network connection issues is throttling, some AWS APIs will block new connections from the same IP for throttling (rather than wasting effort returning a throttling error in the response). We have seen this with the CloudWatch Logs API. So, the first thing to check when you experience network connection issues is your log ingestion/throughput rate and the limits for your destination.
I notice you are using the default retry_limit, consider increasing it: https://docs.fluentbit.io/manual/administration/scheduling-and-retries
With high throughput and only one default retry log loss could happen occasionally.
@PettitWesley Thank you for replying. Throughput is maximum 2.20 MiB/s, mostly around 1Mib/s and I do not see throttle exceptions from Firehose. We have added a retry limit that seems to cover most log loss cases but some still happen.
[OUTPUT]
Name kinesis_firehose
Match kube.*
region eu-central-1
delivery_stream production_de_logs_stream
Retry_Limit 3
workers 2
Adding some more bad logs:
[2023/05/17 13:09:31] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send log records to production_de_logs_stream
[2023/05/17 13:09:31] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send log records
[2023/05/17 13:09:31] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send records
[2023/05/17 13:09:31] [ warn] [engine] failed to flush chunk '1-1684328971.482540519.flb', retry in 6 seconds: task_id=0, input=tail.1 > output=kinesis_firehose.1 (out_id=1)
[2023/05/17 13:09:37] [ info] [engine] flush chunk '1-1684328971.482540519.flb' succeeded at retry 1: task_id=0, input=tail.1 > output=kinesis_firehose.1 (out_id=1)
[2023/05/17 13:13:25] [error] [http_client] broken connection to firehose.eu-central-1.amazonaws.com:443 ?
[2023/05/17 13:13:25] [error] [http_client] broken connection to firehose.eu-central-1.amazonaws.com:443 ?
[2023/05/17 13:13:25] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send log records to production_de_logs_stream
[2023/05/17 13:13:25] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send log records
[2023/05/17 13:13:25] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send records
[2023/05/17 13:13:25] [ warn] [engine] failed to flush chunk '1-1684329205.285653322.flb', retry in 6 seconds: task_id=0, input=tail.1 > output=kinesis_firehose.1 (out_id=1)
[2023/05/17 13:13:31] [error] [http_client] broken connection to firehose.eu-central-1.amazonaws.com:443 ?
[2023/05/17 13:13:31] [error] [http_client] broken connection to firehose.eu-central-1.amazonaws.com:443 ?
[2023/05/17 13:13:31] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send log records to production_de_logs_stream
[2023/05/17 13:13:31] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send log records
[2023/05/17 13:13:31] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send records
[2023/05/17 13:13:31] [ warn] [engine] failed to flush chunk '1-1684329205.285653322.flb', retry in 15 seconds: task_id=0, input=tail.1 > output=kinesis_firehose.1 (out_id=1)
[2023/05/17 13:13:37] [error] [tls] error: unexpected EOF
[2023/05/17 13:13:37] [error] [aws_client] connection initialization error
[2023/05/17 13:13:45] [error] [http_client] broken connection to firehose.eu-central-1.amazonaws.com:443 ?
[2023/05/17 13:13:45] [error] [tls] error: error:00000006:lib(0):func(0):EVP lib
[2023/05/17 13:13:45] [error] [src/flb_http_client.c:1189 errno=25] Inappropriate ioctl for device
[2023/05/17 13:13:45] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send log records to production_de_logs_stream
[2023/05/17 13:13:45] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send log records
[2023/05/17 13:13:45] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send records
[2023/05/17 13:13:45] [ warn] [engine] failed to flush chunk '1-1684329224.768845563.flb', retry in 7 seconds: task_id=1, input=tail.1 > output=kinesis_firehose.1 (out_id=1)
[2023/05/17 13:13:46] [ info] [engine] flush chunk '1-1684329205.285653322.flb' succeeded at retry 2: task_id=0, input=tail.1 > output=kinesis_firehose.1 (out_id=1)
[2023/05/17 13:13:52] [ info] [engine] flush chunk '1-1684329224.768845563.flb' succeeded at retry 1: task_id=1, input=tail.1 > output=kinesis_firehose.1 (out_id=1)
[2023/05/17 13:13:54] [error] [http_client] broken connection to firehose.eu-central-1.amazonaws.com:443 ?
[2023/05/17 13:14:35] [error] [http_client] broken connection to firehose.eu-central-1.amazonaws.com:443 ?
[2023/05/17 13:14:35] [error] [http_client] broken connection to firehose.eu-central-1.amazonaws.com:443 ?
[2023/05/17 13:14:35] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send log records to production_de_logs_stream
[2023/05/17 13:14:35] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send log records
[2023/05/17 13:14:35] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send records
[2023/05/17 13:14:35] [ warn] [engine] failed to flush chunk '1-1684329275.63531396.flb', retry in 8 seconds: task_id=0, input=tail.1 > output=kinesis_firehose.1 (out_id=1)
[2023/05/17 13:14:43] [ info] [engine] flush chunk '1-1684329275.63531396.flb' succeeded at retry 1: task_id=0, input=tail.1 > output=kinesis_firehose.1 (out_id=1)
[2023/05/17 13:20:00] [ info] [filter:kubernetes:kubernetes.0] token updated
[2023/05/17 13:29:03] [error] [tls] error: unexpected EOF
[2023/05/17 13:29:03] [error] [aws_client] connection initialization error
[2023/05/17 13:30:00] [error] [tls] error: unexpected EOF
[2023/05/17 13:30:00] [error] [aws_client] connection initialization error
[2023/05/17 13:30:01] [ info] [filter:kubernetes:kubernetes.0] token updated
Note, I have bumped the version to v2.31.10, the same errors here :)
Some examples of errors from one of our own tests, these errors do happen occasionally: https://github.com/aws/aws-for-fluent-bit/pull/654
Did you ever figure out a solution for this?
Facing the same issue in version v2.1.6
[2024/01/24 09:36:19] [error] [tls] error: error:00000006:lib(0):func(0):EVP lib [2024/01/24 09:36:19] [error] [tls] error: error:00000006:lib(0):func(0):EVP lib [2024/01/24 09:36:19] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send log records to kinesis-events [2024/01/24 09:36:19] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send log records [2024/01/24 09:36:19] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send records [2024/01/24 09:36:19] [error] [engine] chunk '14612-1706088928.508247200.flb' cannot be retried: task_id=0, input=tail.1 > output=kinesis_firehose.1
Describe the question/issue
Fluentbit logs are showing frequent network connection issues to firehose:
Occasionally we also see log loss:
We have upgraded Fluentbit to
v2.31.9
but we still see the same errors.Configuration
Fluent Bit Log Output
Fluent Bit Version Info
v2.31.9
Cluster Details
EKS - Kubernetes version 1.21 Daemon deployment for Fluent Bit
Related Issues
I saw this one which seemed similar - https://github.com/aws/aws-for-fluent-bit/issues/354 Decided to open this bug after reading https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#how-do-i-tell-if-fluent-bit-is-losing-logs & https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#network-connection-issues