aws / aws-for-fluent-bit

The source of the amazon/aws-for-fluent-bit container image
Apache License 2.0
457 stars 135 forks source link

aws-for-fluent-bit: Container Fails to Send Logs After STS Failure Due to Throttling #711

Open faryeyay opened 1 year ago

faryeyay commented 1 year ago

Describe the question/issue

We're seeing an issue where our FluentBit containers are failing to authenticate with STS. Our containers are failing due to throttling. We see periods of heavy workloads where we see a large number of clients attempt to authenticate with STS. Some of the attempts fail. The workloads are services and jobs that we run in Kubernetes on EKS. In addition, we're seeing a number of our FluentBit containers fail to authenticate. These containers end up in a zombie state where none of the logs on that node are sent to Kinesis Firehose.

Configuration

Here's our FluentBit configuration:

apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: fluentbit
  labels:
    k8s-app: fluent-bit
data:
  fluent-bit.conf: |
    [SERVICE]
        Flush         5
        Log_Level     debug
        Daemon        Off
        Parsers_File  parsers.conf
        HTTP_Server   On
        HTTP_Listen   0.0.0.0
        HTTP_Port     2020

    @INCLUDE input.conf
    @INCLUDE filter.conf
    @INCLUDE output.conf

  input.conf: |
    [INPUT]
    # Alias format
    # <plugin_type>__<tag_name>(__<extra_info>: OPTIONAL)
    [INPUT]
        Name              tail
        Tag               logging_service
        Alias             tail__logging_service
        Path              /var/log/containers/*.log
        DB                /var/log/k8s_logging_service.db
        Log_Level         info
        Mem_Buf_Limit     20MB
        Buffer_Max_Size   1MB
        Skip_Long_Lines   On
        Refresh_Interval  10
        Docker_Mode       On
        Parser            docker

  filter.conf: |

    [FILTER]
        Name          parser
        Match         logging_service
        Alias         parser__logging_service__observability_logging_parser
        Key_Name      log
        Parser        observability_logging_parser

    [FILTER]
        Name          grep
        Match         logging_service
        Alias         grep__logging_service__health_checks
        exclude       $body['message'] .*?\/admin\/healthcheck.*?200.*?

  parsers.conf: |
    [PARSER]
        Name        ogging_parser
        Format      json
        Time_Key    timestamp
        Time_Format %Y-%m-%d %H:%M:%S,%L
        Time_Keep   On

    [PARSER]
        Name         docker
        Format       json
        Time_Key     time
        Time_Format  %Y-%m-%dT%H:%M:%S.%L
        Time_Keep    On

    [OUTPUT]
        Name            kinesis_firehose
        Match           logging_service
        Alias           kinesis_firehose__service__logging_service
        region          us-east-1
        delivery_stream logging_service
        Retry_Limit     10

In addition, here is the configuration that we're using for the container:

        env:
        - name: FLUENT_ELASTICSEARCH_HOST
          value: "elasticsearch"
        - name: FLUENT_ELASTICSEARCH_PORT
          value: "9200"
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        - name: fluent-bit-config
          mountPath: /fluent-bit/etc/
        resources:
          requests:
            cpu: 50m
            memory: 50Mi
      terminationGracePeriodSeconds: 10
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers
      - name: fluent-bit-config
        configMap:
          name: fluent-bit-config
      serviceAccountName: fluent-bit

Fluent Bit Log Output

[2023/06/13 19:48:12] [error] [aws_client] auth error, refreshing creds
[2023/06/13 19:48:12] [error] [aws_credentials] Shared credentials file /root/.aws/credentials does not exist
[2023/06/13 19:48:12] [error] [aws_credentials] STS assume role request failed

Here's the log output using the debug image:

"[2023/07/28 21:48:19] [debug] [http_client] server sts.us-east-1.amazonaws.com:443 will close connection #90",
"[2023/07/28 21:48:19] [debug] [aws_client] sts.us-east-1.amazonaws.com: http_do=0, HTTP Status: 400",
"[2023/07/30 04:36:43] [debug] [aws_client] Unable to parse API response- response is not valid JSON.",
"[2023/07/30 04:36:43] [debug] [aws_credentials] STS raw response: ",
"<ErrorResponse xmlns=\"https://sts.amazonaws.com/doc/2011-06-15/\">",
"  <Error>",
"    <Type>Sender</Type>",
"    <Code>Throttling</Code>",
"    <Message>Rate exceeded</Message>",
"  </Error>",
"  <RequestId>be478741-1034-4953-93be-fd2994753758</RequestId>",
"</ErrorResponse>",
"[2023/07/30 04:36:43] [error] [aws_credentials] STS assume role request failed",
"[2023/07/30 04:36:43] [debug] [socket] could not validate socket status for #95 (don't worry)",
"[2023/07/30 04:36:43] [debug] [http_client] server firehose.us-east-1.amazonaws.com:443 will close connection #92",
"[2023/07/30 04:36:43] [debug] [aws_client] firehose.us-east-1.amazonaws.com: http_do=0, HTTP Status: 400",
"[2023/07/30 04:36:43] [error] [aws_client] auth error, refreshing creds",

You can see the message that says Rate exceeded.

Fluent Bit Version Info

We're running version 2.28.4. Specifically, here's the image: public.ecr.aws/aws-observability/aws-for-fluent-bit:2.28.4

Cluster Details

We're running AWS for FluentBit on Kubernetes. We're using EKS. The version is 1.23. In addition, we're using a custom AMI based on Amazon Linux 2 EKS AMI. We don't currently use VPC endpoints for Kinesis Firehose. We're running FluentBit in a daemonset.

Application Details

N/A

Steps to reproduce issue

This issue is difficult to reproduce. You need to have a lot of clients hitting STS at once.

Related Issues

N/A

matthewfala commented 1 year ago

Hi, when you mention that Fluent Bit is ending up in a zombie state, do you mean that STS is no longer retrying and that Fluent Bit is hanging, or rather that STS keeps getting throttled.

PettitWesley commented 1 year ago

We've had a lot of bug fixes since 2.28.4 which was released last year, so please upgrade: https://github.com/aws/aws-for-fluent-bit/releases

To be clear, not sure this iwll fix the issue, just upgrade in general for latest patches.

PettitWesley commented 1 year ago

In terms of our implementation:

I do not think Fluent Bit code could cause throttling from STS. I think this has something to do with your setup or deployment.

We recommend reaching out to STS team via AWS Support to see if your call patterns can determine why you made so many calls to STS.