fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.86k stars 1.59k forks source link

EKS 1.29 Windows node - Fluent Bit error #9409

Open Chandramouli15 opened 1 month ago

Chandramouli15 commented 1 month ago

Intermittent error getting in fluentbit, restarting the ds will temporarily resolve the error. But After 10-20 days same error is coming.

During this error time application logs are not streaming to cloudwatch log group.

We're using AWS managed amazon-cloudwatch-observability addon, version v1.6.0-eksbuild.1.

Fluentbit pod Error :

[C:\build\fluent-bit\lib\chunkio\src\cio_memfs.c:50 errno=12] Not enough space [2024/07/25 12:15:10] [error] [input chunk] could not create chunk file: tail.1:6716-1721909710.306189500.flb [2024/07/25 12:15:10] [error] [input chunk] no available chunk [C:\build\fluent-bit\lib\chunkio\src\cio_memfs.c:50 errno=12] Not enough space

Increased the buffer limit to 2.5 GB , but same error.

apiVersion: v1 data: application-log.conf: | [INPUT] Name tail Tag application. Exclude_Path C:\var\log\containers\fluent-bit, C:\var\log\containers\cloudwatch-agent Path C:\var\log\containers\.log Parser docker DB C:\var\fluent-bit\state\flb_container.db Mem_Buf_Limit 2500MB Skip_Long_Lines On Rotate_Wait 30 Refresh_Interval 10 Read_from_Head ${READ_FROM_HEAD}

[INPUT]
    Name                tail
    Tag                 application.*
    Path                C:\\var\\log\\containers\\fluent-bit*
    Parser              docker
    DB                  C:\\var\\fluent-bit\\state\\flb_log.db
    Mem_Buf_Limit       2500MB
    Skip_Long_Lines     On
    Rotate_Wait         30
    Refresh_Interval    10
    Read_from_Head      ${READ_FROM_HEAD}

[INPUT]
    Name                tail
    Tag                 application.*
    Path                C:\\var\\log\\containers\\cloudwatch-agent*
    Parser              docker
    DB                  C:\\var\\fluent-bit\\state\\flb_cwagent.db
    Mem_Buf_Limit       2500MB
    Skip_Long_Lines     On
    Rotate_Wait         30
    Refresh_Interval    10
    Read_from_Head      ${READ_FROM_HEAD}

[OUTPUT]
    Name                cloudwatch_logs
    Match               application.*
    region              ${AWS_REGION}
    log_group_name      /aws/containerinsights/${CLUSTER_NAME}/application
    log_stream_prefix   ${HOST_NAME}-
    auto_create_group   true
    extra_user_agent    container-insights

dataplane-log.conf: | [INPUT] Name tail Tag dataplane.tail. Path C:\ProgramData\containerd\root\*.log, C:\ProgramData\Amazon\EKS\logs\.log Parser dataplane_firstline DB C:\var\fluent-bit\state\flb_dataplane_tail.db Mem_Buf_Limit 2500MB Skip_Long_Lines On Rotate_Wait 30 Refresh_Interval 10 Read_from_Head ${READ_FROM_HEAD}

[INPUT]
    Name                tail
    Tag                 dataplane.tail.C.ProgramData.Amazon.EKS.logs.vpc-bridge
    Path                C:\\ProgramData\\Amazon\\EKS\\logs\\*.log.*
    Path_Key            file_name
    Parser              dataplane_firstline
    DB                  C:\\var\\fluent-bit\\state\\flb_dataplane_cni_tail.db
    Mem_Buf_Limit       2500MB
    Skip_Long_Lines     On
    Rotate_Wait         30
    Refresh_Interval    10
    Read_from_Head      ${READ_FROM_HEAD}

[FILTER]
    Name                aws
    Match               dataplane.*
    imds_version        v2

[OUTPUT]
    Name                cloudwatch_logs
    Match               dataplane.*
    region              ${AWS_REGION}
    log_group_name      /aws/containerinsights/${CLUSTER_NAME}/dataplane
    log_stream_prefix   ${HOST_NAME}-
    auto_create_group   true
    extra_user_agent    container-insights

fluent-bit.conf: | [SERVICE] Flush 5 Log_Level error Daemon off net.dns.resolver LEGACY Parsers_File parsers.conf

@INCLUDE application-log.conf
@INCLUDE dataplane-log.conf
@INCLUDE host-log.conf

host-log.conf: | [INPUT] Name winlog Channels EKS, System DB C:\var\fluent-bit\state\flb_system_winlog.db Interval_Sec 60

[FILTER]
    Name                aws
    Match               winlog.*
    imds_version        v2

[OUTPUT]
    Name                cloudwatch_logs
    Match               winlog.*
    region              ${AWS_REGION}
    log_group_name      /aws/containerinsights/${CLUSTER_NAME}/host
    log_stream_prefix   ${HOST_NAME}.
    auto_create_group   true
    extra_user_agent    container-insights

parsers.conf: | [PARSER] Name docker Format json Time_Key time Time_Format %b %d %H:%M:%S

[PARSER]
    Name                container_firstline
    Format              regex
    Regex               (?<log>(?<="log":")\S(?!\.).*?)(?<!\\)".*(?<stream>(?<="stream":").*?)".*(?<time>\d{4}-\d{1,2}-\d{1,2}T\d{2}:\d{2}:\d{2}\.\w*).*(?=})
    Time_Key            time
    Time_Format         %Y-%m-%dT%H:%M:%S.%LZ

[PARSER]
    Name                dataplane_firstline
    Format              regex
    Regex               (?<log>(?<="log":")\S(?!\.).*?)(?<!\\)".*(?<stream>(?<="stream":").*?)".*(?<time>\d{4}-\d{1,2}-\d{1,2}T\d{2}:\d{2}:\d{2}\.\w*).*(?=})
    Time_Key            time
    Time_Format         %Y-%m-%dT%H:%M:%S.%LZ

kind: ConfigMap

Note:Only modified the buffer limit in default config map.

patrick-stephens commented 1 month ago

Please provide full details from the template, i.e. what version of Fluent Bit and if not latest then try that? Could you sort out formatting of the text as well, it's a little confusing?

What are the metrics like on input vs output rates? Is there a spike when this happens or is it slowing filling up, etc.?

Chandramouli15 commented 1 month ago

Version details Fluentbit version

Error

image

We're using VPC endpoint for connectivity, checked the connectivity able to access the endpoint from the nodes. For the TLS error anything related to cert-manager ?.

patrick-stephens commented 1 month ago

Images give me private errors plus please just use text and follow the template for all the details.