Open LucasHantz opened 2 years ago
This is current recommendation for CloudWatch plugin config. Could you try these config?
Also i wonder how your load testing is running? These errors shows only in low throughput but not high throughput seems not make sense to me.
The above graphs are of a load test we did on our application and the metrics generated by Firelens during that time. We had no "[http_client] broken connection" during that time but we had new errors later during that day when the cluster was idle.
From what I see in the guidance, this config helps for high throughput cases which is not the problem here. Should I try it anyway?
Tried with the new config, and still seeing: [error] [net] connection #51 timeout after 10 seconds to: logs.eu-west-1.amazonaws.com:443
Any new on this please?
@LucasHantz , may I confirm with you that the issue only occurs with low throughput? So you can confirm that you run fluent bit in the same way and same config and you see problems only with a lower ingestion rate right? May I know what the throughput is?
The issue happens in fact both the low and high throughput. The following graph is the number of record per minute in the last 8h.
As you can see, 2 times in the last 8 hours we have fluent bit falling and not reporting any new logs. At this time this is the error log we have: [2022/06/21 14:15:31] [error] [upstream] connection #609 to firehose.eu-west-1.amazonaws.com:443 timed out after 10 seconds [2022/06/21 14:15:31] [error] [aws_client] connection initialization error
Until the fluentbit container exploded in memory and force the whole task to shutdown
Any thoughts on this? What I can provide more to help figure out this problem?
Just saw the issue raised on https://github.com/fluent/fluent-bit/issues/5705 I'm getting this error as well in our traces
@PettitWesley maybe? Any way to get that pushed up in the line, it's impacting our prod and I don't see how to revert back to a stable solution on this
@LucasHantz Unfortunately right now I don't have any good ideas beyond using the settings here: https://github.com/aws/aws-for-fluent-bit/issues/340
And checking this: https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#network-connection-issues
Describe the question/issue
I'm having a broken connection error to Firehose and Cloudwatch on containers with low traffic as they are in a staging environment. Once they log this error, the RAM usage is only growing until they reach the maximum threshold and the task is killed.
Configuration
Fluent Bit Log Output
Fluent Bit Version Info
I'm reproducing the same error with the stable and latest version of the image.
Cluster Details
ECS Fargate with awsvpc networking. The Firehose and Cloudwatch VPC endpoints are enabled.
Application Details
NTR
Steps to reproduce issue
We have run load testing on the container with the same configuration without noticing this error, so it seems this error is happening when the throughput is low.
Related Issues
This is the new configuration I've come up with based on the recommendation given here: https://github.com/aws/aws-for-fluent-bit/issues/351
Let me know if I did something wrong.