Cannot reproduce issue - Fluentbit stops sending logs to s3

danielejuan-metr commented 1 year ago

Describe the question/issue

May we request for assistance on how to go about the incident we encountered in our development environment. After some time running fluentbit in EKS, fluentbit suddenly stopped sending files to S3 for tags that matches application.* as described in the config below. There were no updates/activities to the fluentbit application.

We were expecting 'Successfully uploaded object' in fluentbit logs for the application logs, however we were only able to see successful uploads for dataplane and host logs.

We were unable to find similar Issues here in git, hoping you can advise on steps to possibly reproduce the issue.

Configuration

  fluent-bit.conf: |
    [SERVICE]
        Flush                     5
        Grace                     60
        Log_Level                 info
        Daemon                    off
        Parsers_File              parsers.conf
        HTTP_Server               ${HTTP_SERVER}
        HTTP_Listen               0.0.0.0
        HTTP_Port                 ${HTTP_PORT}
        storage.path              /var/fluent-bit/state/flb-storage/
        storage.sync              normal
        storage.checksum          off
        storage.backlog.mem_limit 5M
        storage.metrics           on

    @INCLUDE application-log.conf
    @INCLUDE dataplane-log.conf
    @INCLUDE host-log.conf

  application-log.conf: |
    [INPUT]
        Name                tail
        Tag                 application.*
        Exclude_Path        /var/log/containers/cloudwatch-agent*, /var/log/containers/fluent-bit*, /var/log/containers/aws-node*, /var/log/containers/kube-proxy*
        Path                /var/log/containers/*.log
        Docker_Mode         On
        Docker_Mode_Flush   5
        Docker_Mode_Parser  container_firstline
        Parser              docker
        DB                  /var/fluent-bit/state/flb_container.db
        Mem_Buf_Limit       50MB
        Skip_Long_Lines     On
        Refresh_Interval    10
        Rotate_Wait         30
        Read_from_Head      ${READ_FROM_HEAD}
        storage.type        filesystem

    [OUTPUT]
        Name                      s3
        Match                     application.*
        region                    ${AWS_REGION}
        bucket                    ${S3_BUCKET_NAME}
        total_file_size           256M
        upload_timeout            5m
        compression               gzip
        use_put_object            On
        s3_key_format             ${S3_OBJECT_PREFIX}logs-apps/%Y/%m/%d/%Y%m%d%H%M%S-$TAG-$UUID.log.gz
        store_dir                 /var/fluent-bit/tmp/s3
        preserve_data_ordering    On
        workers                   1

    [FILTER]
        Name                      rewrite_tag
        Match                     application.*
        Rule                      $log "prefix-to-find" structured-logs true
        Emitter_Storage.type      filesystem
        Emitter_Name              structured-logs

    [OUTPUT]
        Name                      splunk
        Match                     structured-logs
        Host                      ${SPLUNK_HOST}
        Port                      ${SPLUNK_PORT}
        #TLS                       On
        #TLS.Verify                Off
        splunk_token              ${SPLUNK_TOKEN}
        storage.total_limit_size  10G

  dataplane-log.conf: |
    [INPUT]
        Name                systemd
        Tag                 dataplane.systemd.*
        Systemd_Filter      _SYSTEMD_UNIT=docker.service
        Systemd_Filter      _SYSTEMD_UNIT=kubelet.service
        DB                  /var/fluent-bit/state/systemd.db
        Path                /var/log/journal
        Read_From_Tail      ${READ_FROM_TAIL}

    [INPUT]
        Name                tail
        Tag                 dataplane.tail.*
        Path                /var/log/containers/aws-node*, /var/log/containers/kube-proxy*
        Docker_Mode         On
        Docker_Mode_Flush   5
        Docker_Mode_Parser  container_firstline
        Parser              docker
        DB                  /var/fluent-bit/state/flb_dataplane_tail.db
        Mem_Buf_Limit       50MB
        Skip_Long_Lines     On
        Refresh_Interval    10
        Rotate_Wait         30
        Read_from_Head      ${READ_FROM_HEAD}

    [FILTER]
        Name                modify
        Match               dataplane.systemd.*
        Rename              _HOSTNAME                   hostname
        Rename              _SYSTEMD_UNIT               systemd_unit
        Rename              MESSAGE                     message
        Remove_regex        ^((?!hostname|systemd_unit|message).)*$

    [FILTER]
        Name                aws
        Match               dataplane.*
        imds_version        v1

    [OUTPUT]
        Name                      s3
        Match                     dataplane.*
        region                    ${AWS_REGION}
        bucket                    ${S3_BUCKET_NAME}
        total_file_size           256M
        upload_timeout            10m
        s3_key_format             ${S3_OBJECT_PREFIX}logs-dataplane/%Y/%m/%d/%Y%m%d%H%M%S-$TAG-$UUID.log.gz

  host-log.conf: |
    [INPUT]
        Name                tail
        Tag                 host.dmesg
        Path                /var/log/dmesg
        Parser              syslog
        DB                  /var/fluent-bit/state/flb_dmesg.db
        Mem_Buf_Limit       5MB
        Skip_Long_Lines     On
        Refresh_Interval    10
        Read_from_Head      ${READ_FROM_HEAD}

    [INPUT]
        Name                tail
        Tag                 host.messages
        Path                /var/log/messages
        Parser              syslog
        DB                  /var/fluent-bit/state/flb_messages.db
        Mem_Buf_Limit       5MB
        Skip_Long_Lines     On
        Refresh_Interval    10
        Read_from_Head      ${READ_FROM_HEAD}

    [INPUT]
        Name                tail
        Tag                 host.secure
        Path                /var/log/secure
        Parser              syslog
        DB                  /var/fluent-bit/state/flb_secure.db
        Mem_Buf_Limit       5MB
        Skip_Long_Lines     On
        Refresh_Interval    10
        Read_from_Head      ${READ_FROM_HEAD}

    [FILTER]
        Name                aws
        Match               host.*
        imds_version        v1

    [OUTPUT]
        Name                      s3
        Match                     host.*
        region                    ${AWS_REGION}
        bucket                    ${S3_BUCKET_NAME}
        total_file_size           256M
        upload_timeout            10m
        s3_key_format             ${S3_OBJECT_PREFIX}logs-host/%Y/%m/%d/%Y%m%d%H%M%S-$TAG-$UUID.log.gz

  parsers.conf: |
    [PARSER]
        Name                docker
        Format              json
        Time_Key            time
        Time_Format         %Y-%m-%dT%H:%M:%S.%LZ

    [PARSER]
        Name                syslog
        Format              regex
        Regex               ^(?<time>[^ ]* {1,2}[^ ]* [^ ]*) (?<host>[^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?(?:[^\:]*\:)? *(?<message>.*)$
        Time_Key            time
        Time_Format         %b %d %H:%M:%S

    [PARSER]
        Name                container_firstline
        Format              regex
        Regex               (?<log>(?<="log":")\S(?!\.).*?)(?<!\\)".*(?<stream>(?<="stream":").*?)".*(?<time>\d{4}-\d{1,2}-\d{1,2}T\d{2}:\d{2}:\d{2}\.\w*).*(?=})
        Time_Key            time
        Time_Format         %Y-%m-%dT%H:%M:%S.%LZ

    [PARSER]
        Name                cwagent_firstline
        Format              regex
        Regex               (?<log>(?<="log":")\d{4}[\/-]\d{1,2}[\/-]\d{1,2}[ T]\d{2}:\d{2}:\d{2}(?!\.).*?)(?<!\\)".*(?<stream>(?<="stream":").*?)".*(?<time>\d{4}-\d{1,2}-\d{1,2}T\d{2}:\d{2}:\d{2}\.\w*).*(?=})
        Time_Key            time
        Time_Format         %Y-%m-%dT%H:%M:%S.%LZ

Fluent Bit Log Output

During the issue, we were expecting to see "Successfully uploaded object" in fluentbit logs but we were not able to. Our log level was at info so we decided to update it to debug to investigate. After restarting the issue did not occur anymore and fluentbit started sending backlogs.

We were not able to see error logs in fluentbit during the issue.

Logs after restarting:

Metrics during the issue

{
    "input_chunks": {
        "storage_backlog.6": {
            "chunks": {
                "busy": 0,
                "busy_size": "0b",
                "down": 0,
                "total": 0,
                "up": 0
            },
            "status": {
                "mem_limit": "0b",
                "mem_size": "0b",
                "overlimit": false
            }
        },
        "structured-logs": {
            "chunks": {
                "busy": 0,
                "busy_size": "0b",
                "down": 5,
                "total": 5,
                "up": 0
            },
            "status": {
                "mem_limit": "9.5M",
                "mem_size": "0b",
                "overlimit": false
            }
        },
        "systemd.1": {
            "chunks": {
                "busy": 1,
                "busy_size": "267b",
                "down": 0,
                "total": 1,
                "up": 1
            },
            "status": {
                "mem_limit": "0b",
                "mem_size": "267b",
                "overlimit": false
            }
        },
        "tail.0": {
            "chunks": {
                "busy": 128,
                "busy_size": "497.0K", <--- why stuck in busy?
                "down": 27539,
                "total": 27667,
                "up": 128
            },
            "status": {
                "mem_limit": "47.7M",
                "mem_size": "497.0K",
                "overlimit": false
            }
        },
        "tail.2": {
            "chunks": {
                "busy": 0,
                "busy_size": "0b",
                "down": 0,
                "total": 0,
                "up": 0
            },
            "status": {
                "mem_limit": "47.7M",
                "mem_size": "0b",
                "overlimit": false
            }
        },
        "tail.3": {
            "chunks": {
                "busy": 0,
                "busy_size": "0b",
                "down": 0,
                "total": 0,
                "up": 0
            },
            "status": {
                "mem_limit": "4.8M",
                "mem_size": "0b",
                "overlimit": false
            }
        },
        "tail.4": {
            "chunks": {
                "busy": 1,
                "busy_size": "248b",
                "down": 0,
                "total": 1,
                "up": 1
            },
            "status": {
                "mem_limit": "4.8M",
                "mem_size": "248b",
                "overlimit": false
            }
        },
        "tail.5": {
            "chunks": {
                "busy": 0,
                "busy_size": "0b",
                "down": 0,
                "total": 0,
                "up": 0
            },
            "status": {
                "mem_limit": "4.8M",
                "mem_size": "0b",
                "overlimit": false
            }
        }
    },
    "storage_layer": {
        "chunks": {
            "fs_chunks": 27672,
            "fs_chunks_down": 27544, <- should be low
            "fs_chunks_up": 128,
            "mem_chunks": 2,
            "total_chunks": 27674
        }
    }
}

{
    "filter": {
        "aws.2": {
            "add_records": 0,
            "drop_records": 1175
        },
        "aws.3": {
            "add_records": 0,
            "drop_records": 0
        },
        "modify.1": {
            "add_records": 0,
            "drop_records": 0
        },
        "rewrite_tag.0": {
            "add_records": 0,
            "drop_records": 2609,
            "emit_records": 35261
        }
    },
    "input": {
        "storage_backlog.6": {
            "bytes": 0,
            "records": 0
        },
        "structured-logs": {
            "bytes": 9707328,
            "records": 35261
        },
        "systemd.1": {
            "bytes": 540001218,
            "records": 416905
        },
        "tail.0": {
            "bytes": 57972115723,
            "files_closed": 9518,
            "files_opened": 9546,
            "files_rotated": 8250,
            "records": 173106822
        },
        "tail.2": {
            "bytes": 14306143,
            "files_closed": 1,
            "files_opened": 5,
            "files_rotated": 1,
            "records": 63988
        },
        "tail.3": {
            "bytes": 50478,
            "files_closed": 0,
            "files_opened": 1,
            "files_rotated": 0,
            "records": 589
        },
        "tail.4": {
            "bytes": 90990811,
            "files_closed": 4,
            "files_opened": 5,
            "files_rotated": 4,
            "records": 455474
        },
        "tail.5": {
            "bytes": 44286,
            "files_closed": 4,
            "files_opened": 5,
            "files_rotated": 4,
            "records": 332
        }
    },
    "output": {
        "s3.0": {
            "dropped_records": 779,
            "errors": 72,
            "proc_bytes": 21267798,
            "proc_records": 113726,
            "retried_records": 0,
            "retries": 0,
            "retries_failed": 0
        },
        "s3.2": {
            "dropped_records": 291,
            "errors": 0,
            "proc_bytes": 134597248,
            "proc_records": 479426,
            "retried_records": 539,
            "retries": 355,
            "retries_failed": 201
        },
        "s3.3": {
            "dropped_records": 158,
            "errors": 0,
            "proc_bytes": 116137579,
            "proc_records": 456236,
            "retried_records": 380,
            "retries": 199,
            "retries_failed": 100
        },
        "splunk.1": {
            "dropped_records": 0,
            "errors": 0,
            "proc_bytes": 0,
            "proc_records": 0,
            "retried_records": 0,
            "retries": 0,
            "retries_failed": 0
        }
    }
}

We also noticed that the storage utilization of fluentbit is high:

# Result of du -s -m /var/fluent-bit/* in fluentbit pod
55428   /var/fluent-bit/state
27      /var/fluent-bit/tmp

Fluent Bit Version Info

Cluster Details

what is the networking setup?
do you use App Mesh or a service mesh? - No service mesh
does you use VPC endpoints in a network restricted VPC? - yes, s3 endpoint is a vpc gateway endpoint
Is throttling from the destination part of the problem? Please note that occasional transient network connection errors are often caused by exceeding limits. For example, CW API can block/drop Fluent Bit connections when throttling is triggered.
ECS or EKS - EKS
Fargate or EC2 - EC2
Daemon or Sidecar deployment for Fluent Bit - Daemonset

Application Details

Very low traffic since the issue was encountered in a dev environment.

Steps to reproduce issue

We are unable to reproduce the issue

EDIT:

Additional info - AWS EKS version is 1.22 so still using dockershim.

zwj102030 commented 1 year ago

Hey , could you go over the debugging guide to see if this will help.

PettitWesley commented 1 year ago

For info on metrics emitted by Fluent Bit: https://docs.fluentbit.io/manual/administration/monitoring#metric-descriptions

PettitWesley commented 1 year ago

Additional info - AWS EKS version is 1.22 so still using dockershim.

With Docker, k8s uses this log driver: https://docs.docker.com/config/containers/logging/json-file/

It has max-file and max-size settings. It is possible to lose logs because:

Disk space on your nodes is saturated and log files can't be written
The log files fill up and rotate and get deleted (controlled by max-file and max-size) faster than Fluent Bit can read them. If this happens, you should still see some logs for the pods collected, but just some will be lost.

danielejuan-metr commented 1 year ago

Hi @PettitWesley, thanks for the response. We have considered the scenarios in your list. Hoping you can help identify if we have incorrect expectations or assumptions.

For scenario 1, we use filesystem buffering for our tail input. Assuming that there we no new logs written, we expect that the internal buffers of fluentbit will be empty if the pipeline is properly sending out logs. However, we see high storage size of the tail buffer directory and high number of chunks down in the metrics. We cannot identify why the buffer was not being consumed and chunks being sent to s3.

For scenario 2, we expect to have some logs uploaded to S3 if fluentbit could not keep up with the amount of logs going in. However, during the issue there were totally no logs uploaded to S3.

PettitWesley commented 1 year ago

Please see our log loss investigation runbook: https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#log-loss-investigation-runbook

danielejuan-metr commented 1 year ago

Hi @PettitWesley, with the same environment and configuration above we are now trying to test with high log load.

At 8 MB/s application log throughput, we are encountering the following fluent-bit errors and missing log entries in S3:

[2023/08/29 16:53:36] [error] [tls] error: error:00000005:lib(0):func(0):DH lib
[2023/08/29 16:53:36] [error] [src/flb_http_client.c:1199 errno=25] Inappropriate ioctl for device
[2023/08/29 16:53:36] [error] [plugins/in_tail/tail_file.c:1432 errno=2] No such file or directory
[2023/08/29 16:53:36] [error] [plugins/in_tail/tail_fs_inotify.c:147 errno=2] No such file or directory
[2023/08/29 16:53:36] [error] [input:tail:tail.0] inode=16798413 cannot register file /var/log/containers/container-5fcd9d46b5-vgsk6_container-9999550bf6611e4082208c7eaee6b4d8f1784316b6861e2afe06d224b330341f.log

During this test we are also seeing connection errors to Splunk.

Can this be a symptom of scenario 2?

We have seen similar logs on this ticket: https://github.com/fluent/fluent-bit/issues/3039, will you be able to verify if the latest stable of aws-fluenbit has the patch done on the ticket?

We are unable to find documentation in https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md about the "cannot register file" error.

PettitWesley commented 1 year ago

I'm not sure to be honest, sorry. The first two errors are very common and not to worry about necessarily: https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#common-network-errors

brianluong commented 10 months ago

We're seeing something similar. Randomly, our fluentbit container gets "stuck". In addition to not routing logs to the destination (s3), logs for the fluentbit container itself also stop being emitted. It's like the process dies, or hangs on something.

No noticeable spikes in resources. The container didn't OOM.

Logs:

2024-01-03 23:12:49 UTC | TRACE | INFO | (pkg/trace/info/stats.go:91 in LogAndResetStats) | No data received
[2024/01/03 23:12:55] [debug] [output:s3:s3.1] Running upload timer callback (cb_s3_upload)..

Any pointers?

aws / aws-for-fluent-bit