Crashes with "Could not find sequence token in response: response body is empty"

alexmbird commented 3 years ago

Hi folks,

We've been running the 2.12.0 release to ship our logs to CloudWatch with the new cloudwatch_logs plugin. We've been waiting for the fix to renewing STS tokens so this is our first outing with it.

After running reliably for several hours, several of our pods have crashed with:

[2021/03/09 22:21:54] [ info] [output:cloudwatch_logs:cloudwatch_logs.5] Created log stream fluent-bit-z64wb.application.<redacted>.log
[2021/03/09 22:22:55] [error] [output:cloudwatch_logs:cloudwatch_logs.2] Could not find sequence token in response: response body is empty
[2021/03/09 22:22:55] [error] [src/flb_http_client.c:1163 errno=32] Broken pipe
[2021/03/09 22:22:55] [error] [output:cloudwatch_logs:cloudwatch_logs.5] Failed to send log events
[2021/03/09 22:22:55] [error] [output:cloudwatch_logs:cloudwatch_logs.5] Failed to send log events
[2021/03/09 22:22:55] [error] [output:cloudwatch_logs:cloudwatch_logs.5] Failed to send events
[2021/03/09 22:22:56] [error] [output:cloudwatch_logs:cloudwatch_logs.4] Could not find sequence token in response: response body is empty
[lib/chunkio/src/cio_file.c:786 errno=9] Bad file descriptor
[2021/03/09 22:22:56] [error] [storage] [cio_file] error setting new file size on write
[2021/03/09 22:22:56] [error] [input chunk] error writing data from tail.5 instance
[lib/chunkio/src/cio_file.c:786 errno=9] Bad file descriptor
[2021/03/09 22:22:56] [error] [storage] [cio_file] error setting new file size on write
[2021/03/09 22:22:56] [error] [input chunk] error writing data from tail.5 instance
[2021/03/09 22:23:02] [ warn] [engine] failed to flush chunk '1-1615328565.648760323.flb', retry in 8 seconds: task_id=2, input=tail.5 > output=cloudwatch_logs.5 (out_id=5)

After that it exits with an error status and Kubernetes replaces the pod.

Curiously, several replicas of fluentbit failed with the same error at once. This makes me wonder if the CloudWatch API was briefly unavaiable. But if so, I'd expect the behaviour to be that it retries rather than taking down the whole fluentbit replica.

byrneo commented 2 years ago

@PettitWesley, still no appearance of the '...sequence token in response..' errors after another 24 hours of operation.

FYI: I do still see occasional 'Failed to send log events' errors from each of the fluentbit instances:

[2021/09/15 17:04:35] [ warn] [http_client] malformed HTTP response from logs.us-east-1.amazonaws.com:443 on connection #165
[2021/09/15 17:04:35] [error] [output:cloudwatch_logs:cloudwatch_logs.1] Failed to send log events
[2021/09/15 17:04:35] [error] [output:cloudwatch_logs:cloudwatch_logs.1] Failed to send log events
[2021/09/15 17:04:35] [error] [output:cloudwatch_logs:cloudwatch_logs.1] Failed to send events

PettitWesley commented 2 years ago

@byrneo Awesome! I will put up my patch to upstream then... we've also recently found some other upstream networking bugs recently that I suspect might be involved in this issue as well... so while this issue has been open for a very long time... I am finally feeling hopeful that we will soon have a resolution. It'll take a little bit of time to get these fixes merged and released upstream and incorporated into our distro.

Apologies to everyone that the resolution here has/is taken so long.

Funkerman1992 commented 2 years ago

@PettitWesley I've updated to the 2.19.1 image but I still see errors and gaps in logs of my EKS applications :

fluent-bit-9vs2j fluent-bit [2021/09/23 15:50:51] [ warn] [http_client] cannot increase buffer: current=4096 requested=36864 max=4096
fluent-bit-9vs2j fluent-bit [2021/09/23 15:50:51] [error] [output:cloudwatch_logs:cloudwatch_logs.2] Recieved code 200 but response was invalid, x-amzn-RequestId header not found
fluent-bit-9vs2j fluent-bit [2021/09/23 15:50:51] [error] [output:cloudwatch_logs:cloudwatch_logs.2] Failed to send log events
fluent-bit-9vs2j fluent-bit [2021/09/23 15:50:51] [error] [output:cloudwatch_logs:cloudwatch_logs.2] Failed to send events

Here is my fluentbit config :


    [SERVICE]
        Flush                     5
        Log_Level                 info
        Daemon                    off
        Parsers_File              parsers.conf
        HTTP_Server               ${HTTP_SERVER}
        HTTP_Listen               0.0.0.0
        HTTP_Port                 ${HTTP_PORT}
        storage.path              /var/fluent-bit/state/flb-storage/
        storage.sync              normal
        storage.checksum          off
        storage.backlog.mem_limit 5M

    @INCLUDE application-log.conf
    @INCLUDE dataplane-log.conf
    @INCLUDE host-log.conf

  application-log.conf: |
    [INPUT]
        Name                tail
        Tag                 application.*
        Exclude_Path        /var/log/containers/cloudwatch-agent*, /var/log/containers/fluent-bit*, /var/log/containers/aws-node*, /var/log/containers/kube-proxy*
        Path                /var/log/containers/*.log
        Docker_Mode         On
        Docker_Mode_Flush   5
        Docker_Mode_Parser  container_firstline
        Parser              docker
        DB                  /var/fluent-bit/state/flb_container.db
        Mem_Buf_Limit       50MB
        Skip_Long_Lines     On
        Refresh_Interval    10
        Rotate_Wait         30
        storage.type        filesystem
        Read_from_Head      ${READ_FROM_HEAD}

    [FILTER]
        Name                kubernetes
        Match               application.*
        Kube_URL            https://kubernetes.default.svc:443
        Kube_Tag_Prefix     application.var.log.containers.
        Merge_Log           On
        Merge_Log_Key       log_processed
        K8S-Logging.Parser  On
        K8S-Logging.Exclude Off
        Labels              On
        Annotations         Off

    [OUTPUT]
        Name                cloudwatch
        Match               application.*
        region              ${AWS_REGION}
        default_log_group_name FluentBit-Application
        log_group_name      /aws/containerinsights/${CLUSTER_NAME}/$(kubernetes['namespace_name'])/$(kubernetes['labels']['app'])
        log_stream_prefix   ${HOST_NAME}-
        auto_create_group   true
        extra_user_agent    container-insights

  dataplane-log.conf: |
    [INPUT]
        Name                systemd
        Tag                 dataplane.systemd.*
        Systemd_Filter      _SYSTEMD_UNIT=docker.service
        DB                  /var/fluent-bit/state/systemd.db
        Path                /var/log/journal
        Read_From_Tail      ${READ_FROM_TAIL}

    [INPUT]
        Name                tail
        Tag                 dataplane.tail.*
        Path                /var/log/containers/aws-node*, /var/log/containers/kube-proxy*
        Docker_Mode         On
        Docker_Mode_Flush   5
        Docker_Mode_Parser  container_firstline
        Parser              docker
        DB                  /var/fluent-bit/state/flb_dataplane_tail.db
        Mem_Buf_Limit       50MB
        Skip_Long_Lines     On
        Refresh_Interval    10
        Rotate_Wait         30
        storage.type        filesystem
        Read_from_Head      ${READ_FROM_HEAD}

    [FILTER]
        Name                modify
        Match               dataplane.systemd.*
        Rename              _HOSTNAME                   hostname
        Rename              _SYSTEMD_UNIT               systemd_unit
        Rename              MESSAGE                     message
        Remove_regex        ^((?!hostname|systemd_unit|message).)*$

    [FILTER]
        Name                aws
        Match               dataplane.*
        imds_version        v1

    [OUTPUT]
        Name                cloudwatch_logs
        Match               dataplane.*
        region              ${AWS_REGION}
        log_group_name      /aws/containerinsights/${CLUSTER_NAME}-fluentbit/dataplane
        log_stream_prefix   ${HOST_NAME}-
        auto_create_group   true
        extra_user_agent    container-insights

  host-log.conf: |
    [INPUT]
        Name                tail
        Tag                 host.dmesg
        Path                /var/log/dmesg
        Parser              syslog
        DB                  /var/fluent-bit/state/flb_dmesg.db
        Mem_Buf_Limit       5MB
        Skip_Long_Lines     On
        Refresh_Interval    10
        Read_from_Head      ${READ_FROM_HEAD}

    [INPUT]
        Name                tail
        Tag                 host.messages
        Path                /var/log/messages
        Parser              syslog
        DB                  /var/fluent-bit/state/flb_messages.db
        Mem_Buf_Limit       5MB
        Skip_Long_Lines     On
        Refresh_Interval    10
        Read_from_Head      ${READ_FROM_HEAD}

    [INPUT]
        Name                tail
        Tag                 host.secure
        Path                /var/log/secure
        Parser              syslog
        DB                  /var/fluent-bit/state/flb_secure.db
        Mem_Buf_Limit       5MB
        Skip_Long_Lines     On
        Refresh_Interval    10
        Read_from_Head      ${READ_FROM_HEAD}

    [FILTER]
        Name                aws
        Match               host.*
        imds_version        v1

    [OUTPUT]
        Name                cloudwatch_logs
        Match               host.*
        region              ${AWS_REGION}
        log_group_name      ${CLUSTER_NAME}-fluentbit/host
        log_stream_prefix   ${HOST_NAME}.
        auto_create_group   true
        extra_user_agent    container-insights

  parsers.conf: |
    [PARSER]
        Name                docker
        Format              json
        Time_Key            time
        Time_Format         %Y-%m-%dT%H:%M:%S.%LZ

    [PARSER]
        Name                syslog
        Format              regex
        Regex               ^(?<time>[^ ]* {1,2}[^ ]* [^ ]*) (?<host>[^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?(?:[^\:]*\:)? *(?<message>.*)$
        Time_Key            time
        Time_Format         %b %d %H:%M:%S

    [PARSER]
        Name                container_firstline
        Format              regex
        Regex               (?<tag>[^.]+)?\.?(?<pod_name>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-(?<docker_id>[a-z0-9]{64})\.log$
        Time_Key            time
        Time_Format         %Y-%m-%dT%H:%M:%S.%LZ

    [PARSER]
        Name                cwagent_firstline
        Format              regex
        Regex               (?<log>(?<="log":")\d{4}[\/-]\d{1,2}[\/-]\d{1,2}[ T]\d{2}:\d{2}:\d{2}(?!\.).*?)(?<!\\)".*(?<stream>(?<="stream":").*?)".*(?<time>\d{4}-\d{1,2}-\d{1,2}T\d{2}:\d{2}:\d{2}\.\w*).*(?=})
        Time_Key            time
        Time_Format         %Y-%m-%dT%H:%M:%S.%LZ

PettitWesley commented 2 years ago

@byrneo @Funkerman1992 and everyone- update on this.

So basically we have two fixes in progress:

The fix I gave in the previous comment where I shared an image. That was for a stop-gap measure I introduced to immediately auto-retry these invalid requests. I think it helps but it doesn't fix the root cause.
For the actual root cause, we are still uncertain, but we have made some progress. We have found a number of networking issues in the core of fluent bit which are affecting all of the AWS plugins. We've seen a number of reports which I suspect might all be cause by the same set of core networking bugs. We are working on fixing those. Hopefully this will permanently and fully fix the issue.

All of these fixes will take some time to make their way upstream. Right now, everyone can use our branches and pre-release/test builds if they want.

Core Network Fix Only Build

Code is here: https://github.com/krispraws/fluent-bit/commits/v1_7_5_openssl_fix

Image is here: 144718711470.dkr.ecr.us-west-2.amazonaws.com/core-network-fixes:1.7.5

Pull it with:

ecs-cli pull --region us-west-2 --registry-id 144718711470 144718711470.dkr.ecr.us-west-2.amazonaws.com/core-network-fixes:1.7.5

Core Network Fix with Sequence Token stop gap Build

Code is here: https://github.com/PettitWesley/fluent-bit/tree/v1_7_5_openssl_fix_sequence_token

Image is here: 144718711470.dkr.ecr.us-west-2.amazonaws.com/core-network-fixes:1.7.5-sequence-token-stop-gap

Pull it with:

ecs-cli pull --region us-west-2 --registry-id 144718711470 144718711470.dkr.ecr.us-west-2.amazonaws.com/core-network-fixes:1.7.5-sequence-token-stop-gap

Hope this helps/let me know what you see.

rpalanisamy commented 2 years ago

@PettitWesley We have hit this issue on a client project. The workaround of flush 5 and chunk to 1mb worked, but that would not withstand the prod peak load. So, eagerly waiting for your fix. Could you please let us know when this fix would be pushed to upstream docker image (aws-for-fluent-bit)? or did you already include this fix on aws-for-fluent-bit:2.20.0 ?

PettitWesley commented 2 years ago

@rpalanisamy The networking fixes have been included in 2.20.0. Please try out that version.

The sequence token stop gap wasn't included in that release, may be a future one soon. I'm hoping that solving the networking issues will solve this, and that the stop gap fix won't be needed.

byrneo commented 2 years ago

@PettitWesley - i'm still using your stop-gap fix 144718711470.dkr.ecr.us-west-2.amazonaws.com/core-network-fixes:1.7.5-sequence-token-stop-gap

Are the fixes now available in any official releases? Would you now recommend upgrading to v2.23.3

PettitWesley commented 2 years ago

@byrneo Yea, this was contributed upstream and its much safer to use the newest release than my old image

aws / aws-for-fluent-bit

Crashes with "Could not find sequence token in response: response body is empty" #161

Core Network Fix Only Build

Core Network Fix with Sequence Token stop gap Build