fluentbit missing logs in aws cloudwatch

mukshe01 commented 7 months ago

Hi Team,

We are running fluentbit to push application logs from our kubernates cluster(eks cluster with ec2 machines as k8s nodes) to cloudwatch, recently we observed some log entries are missing in cloudwatch when system is on high load.

below is fluentbit config:

fluent-bit.conf: [SERVICE] HTTP_Server On HTTP_Listen 0.0.0.0 HTTP_PORT 2020 Health_Check On HC_Errors_Count 5 HC_Retry_Failure_Count 5 HC_Period 5

Parsers_File /fluent-bit/parsers/parsers.conf [INPUT] Name tail Tag kube. Path /var/log/containers/.log DB /var/log/flb_kube.db Parser docker Docker_Mode On Mem_Buf_Limit 5MB Skip_Long_Lines On Refresh_Interval 10 [FILTER] Name kubernetes Match kube. Kube_URL https://kubernetes.default.svc.cluster.local:443 Merge_Log On Merge_Log_Key data Keep_Log On K8S-Logging.Parser On K8S-Logging.Exclude On Buffer_Size 2048k [OUTPUT] Name cloudwatch_logs Match region us-east-1 log_group_name /aws/containerinsights/one-source-qa-n5p1P1d1/application-new log_stream_prefix fluentbit- log_stream_template $kubernetes['namespace_name'].$kubernetes['container_name'] auto_create_group true

we installed fluentbit in our k8s cluster using helm chart. https://github.com/aws/eks-charts/tree/master/stable/aws-for-fluent-bit fluentbit appVersion: 2.31.11 helm chart version: 0.1.28

we are seeing two types of errors in fluentbit log

2024-03-19T11:32:29.235887301Z stderr F [2024/03/19 11:32:29] [ info] [input:tail:tail.0] inode=26222862 handle rotation(): /var/log/containers/rest-api-qa-954d864f9-smkv5_participant1-qa_rest-api-c5dac2e01 1fe0f093560b815135fff49dfade0835e22fd71c88aed4fa4d86439.log => /var/log/pods/participant1-qa_rest-api-qa-954d864f9-smkv5_319b4e14-e50c-44c6-86ff-558547bbcb3c/rest-api/0.log.20240319-113228 2024-03-19T11:32:29.488386964Z stderr F [2024/03/19 11:32:29] [ info] [input] tail.0 resume (mem buf overlimit) 2024-03-19T11:32:49.909327531Z stderr F [2024/03/19 11:32:49] [ info] [input] tail.0 resume (mem buf overlimit) 2024-03-19T11:32:49.911154349Z stderr F [2024/03/19 11:32:49] [error] [plugins/in_tail/tail_file.c:1432 errno=2] No such file or directory 2024-03-19T11:32:49.911160979Z stderr F [2024/03/19 11:32:49] [error] [plugins/in_tail/tail_fs_inotify.c:147 errno=2] No such file or directory 2024-03-19T11:32:49.911163819Z stderr F [2024/03/19 11:32:49] [error] [input:tail:tail.0] inode=26222863 cannot register file /var/log/containers/rest-api-qa-954d864f9-smkv5_participant1-qa_rest-api-c5dac2e011fe0f093560b815135fff49dfade0835e22fd71c88aed4fa4d86439.log

also many occurances of this(our mem buffer config is Mem_Buf_Limit) when system is on high load:

2024-03-20T13:29:12.624465969Z stderr F [2024/03/20 13:29:12] [ warn] [input] tail.0 paused (mem buf overlimit) 2024-03-20T13:29:12.915368764Z stderr F [2024/03/20 13:29:12] [ info] [input] tail.0 resume (mem buf overlimit) 2024-03-20T13:29:12.923306843Z stderr F [2024/03/20 13:29:12] [ warn] [input] tail.0 paused (mem buf overlimit) 2024-03-20T13:29:12.954591621Z stderr F [2024/03/20 13:29:12] [ info] [input] tail.0 resume (mem buf overlimit) 2024-03-20T13:29:12.956495689Z stderr F [2024/03/20 13:29:12] [ warn] [input] tail.0 paused (mem buf overlimit) 2024-03-20T13:29:13.527593998Z stderr F [2024/03/20 13:29:13] [ info] [input] tail.0 resume (mem buf overlimit)

fyi: kubernates rotates container logs when it gets 10 MB, when system runs high load the log rotation is very frequent.

would you check our config and let us know how we can avoid missing logs in cloudwatch?. please let us know if you need anymore info from us.

Regards Shekhar

alanwu4321 commented 5 months ago

in ConfigMap aws-for-fluent-bit

I had to add auto_create_group true to the bottom, restart the pod, then it worked

[OUTPUT]
    Name                  cloudwatch_logs
    Match                 *
    region                ap-northeast-1
    log_group_name        /aws/eks/ca-prod/aws-fluentbit-logs
    log_stream_prefix     fluentbit-

jdinsel-xealth commented 4 months ago

Have you inspected the fluent-bit containers for their use of CPU or considered increasing the resource settings in the chart?

aws / eks-charts

fluentbit missing logs in aws cloudwatch #1080