Open faryeyay opened 1 year ago
Hi, when you mention that Fluent Bit is ending up in a zombie state, do you mean that STS is no longer retrying and that Fluent Bit is hanging, or rather that STS keeps getting throttled.
We've had a lot of bug fixes since 2.28.4 which was released last year, so please upgrade: https://github.com/aws/aws-for-fluent-bit/releases
To be clear, not sure this iwll fix the issue, just upgrade in general for latest patches.
In terms of our implementation:
I do not think Fluent Bit code could cause throttling from STS. I think this has something to do with your setup or deployment.
We recommend reaching out to STS team via AWS Support to see if your call patterns can determine why you made so many calls to STS.
Describe the question/issue
We're seeing an issue where our FluentBit containers are failing to authenticate with STS. Our containers are failing due to throttling. We see periods of heavy workloads where we see a large number of clients attempt to authenticate with STS. Some of the attempts fail. The workloads are services and jobs that we run in Kubernetes on EKS. In addition, we're seeing a number of our FluentBit containers fail to authenticate. These containers end up in a zombie state where none of the logs on that node are sent to Kinesis Firehose.
Configuration
Here's our FluentBit configuration:
In addition, here is the configuration that we're using for the container:
Fluent Bit Log Output
Here's the log output using the debug image:
You can see the message that says
Rate exceeded
.Fluent Bit Version Info
We're running version
2.28.4
. Specifically, here's the image:public.ecr.aws/aws-observability/aws-for-fluent-bit:2.28.4
Cluster Details
We're running AWS for FluentBit on Kubernetes. We're using EKS. The version is 1.23. In addition, we're using a custom AMI based on Amazon Linux 2 EKS AMI. We don't currently use VPC endpoints for Kinesis Firehose. We're running FluentBit in a daemonset.
Application Details
N/A
Steps to reproduce issue
This issue is difficult to reproduce. You need to have a lot of clients hitting STS at once.
Related Issues
N/A