fluentbit missing logs in aws cloudwatch

mukshe01 commented 3 months ago

Bug Report

Describe the bug We are running fluentbit to push application logs from our kubernates cluster(eks cluster with ec2 machines as k8s nodes) to cloudwatch, recently we observed some log entries are missing in cloudwatch when system is on high load.

To Reproduce

Rubular link if applicable:

Example log message if applicable:


2024-03-19T11:32:29.235887301Z stderr F [2024/03/19 11:32:29] [ info] [input:tail:tail.0] inode=26222862 handle rotation(): /var/log/containers/rest-api-qa-954d864f9-smkv5_participant1-qa_rest-api-c5dac2e01
1fe0f093560b815135fff49dfade0835e22fd71c88aed4fa4d86439.log => /var/log/pods/participant1-qa_rest-api-qa-954d864f9-smkv5_319b4e14-e50c-44c6-86ff-558547bbcb3c/rest-api/0.log.20240319-113228
2024-03-19T11:32:29.488386964Z stderr F [2024/03/19 11:32:29] [ info] [input] tail.0 resume (mem buf overlimit)
2024-03-19T11:32:49.909327531Z stderr F [2024/03/19 11:32:49] [ info] [input] tail.0 resume (mem buf overlimit)
2024-03-19T11:32:49.911154349Z stderr F [2024/03/19 11:32:49] [error] [plugins/in_tail/tail_file.c:1432 errno=2] No such file or directory
2024-03-19T11:32:49.911160979Z stderr F [2024/03/19 11:32:49] [error] [plugins/in_tail/tail_fs_inotify.c:147 errno=2] No such file or directory
2024-03-19T11:32:49.911163819Z stderr F [2024/03/19 11:32:49] [error] [input:tail:tail.0] inode=26222863 cannot register file /var/log/containers/rest-api-qa-954d864f9-smkv5_participant1-qa_rest-api-c5dac2e011fe0f093560b815135fff49dfade0835e22fd71c88aed4fa4d86439.log

also many occurances of this(our mem buffer config is Mem_Buf_Limit) when system is on high load:

2024-03-20T13:29:12.624465969Z stderr F [2024/03/20 13:29:12] [ warn] [input] tail.0 paused (mem buf overlimit) 2024-03-20T13:29:12.915368764Z stderr F [2024/03/20 13:29:12] [ info] [input] tail.0 resume (mem buf overlimit) 2024-03-20T13:29:12.923306843Z stderr F [2024/03/20 13:29:12] [ warn] [input] tail.0 paused (mem buf overlimit) 2024-03-20T13:29:12.954591621Z stderr F [2024/03/20 13:29:12] [ info] [input] tail.0 resume (mem buf overlimit) 2024-03-20T13:29:12.956495689Z stderr F [2024/03/20 13:29:12] [ warn] [input] tail.0 paused (mem buf overlimit) 2024-03-20T13:29:13.527593998Z stderr F [2024/03/20 13:29:13] [ info] [input] tail.0 resume (mem buf overlimit)


- Steps to reproduce the problem:

**Expected behavior**
logs shouldnt be missing in cloudwatch.

**Screenshots**

**Your Environment**
<!--- Include as many relevant details about the environment you experienced the bug in -->
* Version used: fluentbit 2.31.11
* Configuration:
[SERVICE]
    HTTP_Server  On
    HTTP_Listen  0.0.0.0
    HTTP_PORT    2020
    Health_Check On
    HC_Errors_Count 5
    HC_Retry_Failure_Count 5
    HC_Period 5

    Parsers_File /fluent-bit/parsers/parsers.conf
[INPUT]
    Name              tail
    Tag               kube.*
    Path              /var/log/containers/*.log
    DB                /var/log/flb_kube.db
    Parser            docker
    Docker_Mode       On
    Mem_Buf_Limit     5MB
    Skip_Long_Lines   On
    Refresh_Interval  10
[FILTER]
    Name                kubernetes
    Match               kube.*
    Kube_URL            https://kubernetes.default.svc.cluster.local:443
    Merge_Log           On
    Merge_Log_Key       data
    Keep_Log            On
    K8S-Logging.Parser  On
    K8S-Logging.Exclude On
    Buffer_Size         2048k
[OUTPUT]
    Name                  cloudwatch_logs
    Match                 *
    region                us-east-1
    log_group_name        /aws/containerinsights/one-source-qa-n5p1P1d1/application-new
    log_stream_prefix     fluentbit-
    log_stream_template   $kubernetes['namespace_name'].$kubernetes['container_name']
    auto_create_group     true

* Environment name and version (e.g. Kubernetes? What version?):
kubernates, version: 1.27,
helm chart used for fluentbit: 
https://github.com/aws/eks-charts/tree/master/stable/aws-for-fluent-bit
version: 0.1.28

* Server type and version:
aws ec2.
* Operating System and version:
Amazon Linux 2.
aws optimized ami, ami id, ami-013895b64fa9cbcba
* Filters and plugins:
check in config section.
**Additional context**
in app logs we write some useful information about orders, it helps us in telemetry.  fluentbit is configured to look /var/log/containers/*.log, and push them to cloudwatch.
it looks like when system is on high load some logs entries/lines are missing in cloudwatch.

patrick-stephens commented 3 months ago

Please follow the issue template to supply all required information including things like version and target env.

mukshe01 commented 3 months ago

Hi Patrick, apologies, i have followed issue template and update the issue, please let me know if you require any info,

axot commented 3 months ago

There is a similar issue on EKS Fargate https://github.com/aws/aws-for-fluent-bit/issues/796 Could you please check if it is related to NOFILE limit with the following command?

$ kubectl exec -ti REPLACE_WITH_YOUR_FLUENTBIT_POD -- sh -c 'grep files /proc/*/limits; grep -a '\r' /proc/*/cmdline'

mukshe01 commented 3 months ago

Hi , Thank you for your response, below is the output of the command.

/proc/1/limits:Max open files            1048576              1048576              files
/proc/32/limits:Max open files            1048576              1048576              files
/proc/self/limits:Max open files            1048576              1048576              files
/proc/thread-self/limits:Max open files            1048576              1048576              files
/proc/1/cmdline:/fluent-bit/bin/fluent-bit-e/fluent-bit/firehose.so-e/fluent-bit/cloudwatch.so-e/fluent-bit/kinesis.so-c/fluent-bit/etc/fluent-bit.conf
/proc/32/cmdline:sh-cgrep files /proc/*/limits; grep -a r /proc/*/cmdline
/proc/self/cmdline:grep-ar/proc/1/cmdline/proc/32/cmdline/proc/self/cmdline/proc/thread-self/cmdline
/proc/thread-self/cmdline:grep-ar/proc/1/cmdline/proc/32/cmdline/proc/self/cmdline/proc/thread-self/cmdline```

core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 30446
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1048576
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
sh-4.2# ulimit -n
1048576

should we adjust anything?

also we increased Mem_Buf_Limit from 5 mb to 50 mb, and we see significant reduction of missing logs in cloudwatch. would you be able to sufggest any improvement in fluent bit config so the missing logs issue wont occur in future.

Regards Shekhar

axot commented 3 months ago

Thanks for sharing the output. Initially, I suspected it might be related to the NOFile limits, but your results suggest they are sufficient. I'll try to reproduce the issue in my environment. Thanks.

mukshe01 commented 1 month ago

Hi @axot ,

do you have any luck reproducing this issue, please let me know if i need to provide any info.

Regards Shekhar

amber-lamp-dev commented 2 weeks ago

I'm getting a similar error with Fluent Bit in my environment too.

[2024/07/02 10:54:24] [error] [plugins/in_tail/tail_file.c:1432 errno=2] No such file or directory
[2024/07/02 10:54:24] [error] [plugins/in_tail/tail_fs_inotify.c:147 errno=2] No such file or directory
[2024/07/02 10:54:24] [error] [input:tail:tail.0] inode=97518421 cannot register file /var/log/containers/my-nginx-586cfd5d59-9bgqm_default_my-nginx-d7f6466fe5757cc8ff6183b7a764b35dc509ec923f7bd0eee9d52b1b7680a952.log

Information on the environment where the error occurs is as follows.

Version used: Fluent Bit v1.9.10 AWS for Fluent Bit Container Image Version 2.32.2.20240516
Environment name and version kubernates, version: 1.28
Server type and version: aws ec2
Operating System and version: Amazon Linux 2. aws optimized ami, ami id,ami-0e1413630fdbd046e

The steps to reproduce are as follows.

Creating an EKS cluster

Use the following configuration file to create a cluster with eksctl.

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: sample-cluster
  region: ap-northeast-1
  version: "1.28"

vpc:
  id: "< vpc id >"
  cidr: "< cidr >"
  subnets:
    private:
      ap-northeast-1a:
        id: "< subnet id >"
        cidr: "< cidr >"

      ap-northeast-1c:
        id: "< subnet id >"
        cidr: "< cidr >"

managedNodeGroups:
  - name: ng-1
    instanceType: m5.xlarge
    desiredCapacity: 2
    privateNetworking: true

Installing Fluent Bit

Follow the steps below to install Fluent Bit on the EKS cluster created in step 1. Set up Fluent Bit as a DaemonSet to send logs to CloudWatch Logs - Amazon CloudWatch #Setting up Fluent Bit https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-setup-logs-FluentBit.html#Container-Insights-FluentBit-setup

The manifest file I used is shown below.

kubectl apply -f https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/fluent-bit/fluent-bit.yaml

Output plugin have been added and log levels have been changed.

$ diff fluent-bit.yaml backup_fluent-bit.yaml
50c50
<         Log_Level                 info
---
>         Log_Level                 error
116,123d115
<
<     [OUTPUT]
<         Name firehose
<         Match   application.*
<         region ap-northeast-1
<         delivery_stream < The name of the Kinesis Firehose Delivery stream >
<         retry_limit      5

Installing the AWS Load Balancer Controller Follow the steps below to install the AWS Load Balancer Controller. https://docs.aws.amazon.com/eks/latest/userguide/lbc-helm.html
Creating a Docker image Create a Docker image based on the following Dockerfile.

FROM amazonlinux:2023

RUN yum update && \
    yum install nginx -y && \
    yum clean all

RUN ln -sf /dev/stdout /var/log/nginx/access.log \
  && ln -sf /dev/stderr /var/log/nginx/error.log

CMD ["nginx", "-g", "daemon off;"]

Create resources Create Deployment resources based on the Docker image created in 4. The manifest file I used is shown below. nginx is configured to proxy to index.html placed on S3 via a VPC endpoint.

apiVersion: v1
kind: Service
metadata:
  name: my-nginx
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: internal
    service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
    service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
spec:
  ports:
    - port: 80
      targetPort: 8080
      protocol: TCP
  type: LoadBalancer
  selector:
    run: my-nginx
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-nginx
spec:
  selector:
    matchLabels:
      run: my-nginx
  replicas: 2
  template:
    metadata:
      labels:
        run: my-nginx
    spec:
      containers:
        - name: my-nginx
          image: < Image URI >
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: 1500m
              memory: 1G
            limits:
              cpu: 1500m
              memory: 1G
          volumeMounts:
            - mountPath: /etc/nginx/nginx.conf
              name: nginx-config-vol
              subPath: nginx.conf
      volumes:
        - name: nginx-config-vol
          configMap:
              name: nginx-config
              items:
                - key: nginx.conf
                  path: nginx.conf
---
kind: ConfigMap
apiVersion: v1
metadata:
  name: nginx-config
data:
  nginx.conf: |
    worker_rlimit_nofile 30000;
    events {
      worker_connections 10000;
    }
    http {
      server_tokens off;
      client_header_timeout 13s;
      keepalive_timeout 350s;

      upstream s3-vpce {
        server < s3 interface VPC endpoints ipaddress >:80;
        server < s3 interface VPC endpoints ipaddress >:80;
        server < s3 interface VPC endpoints ipaddress >:80;
      }
      map $http_host $s3_backet {
        default "< s3 bucket name >";
      }

      log_format upstreamlog '[$time_local] $http_x_forwarded_for $remote_addr $status $host $upstream_addr $upstream_cache_status $upstream_status $request $http_referer $body_bytes_sent $request_time $http_user_agent';
      access_log /var/log/nginx/access.log upstreamlog;
      error_log /var/log/nginx/error.log notice;
      rewrite_log off;

      server {
        listen 8080;

        location / {
          rewrite  ^/$ /index.html break;
          proxy_set_header  Host $s3_backet;
          proxy_pass        http://s3-vpce;
          proxy_connect_timeout 10s;
          proxy_read_timeout 30s;
        }
      }
    }

Load the Network Load Balancer. Use ApacheBench, version 2.3 to load the Network Load Balancer created in step 5.
```
$ ab -n 10000000 -c 100 -p post.txt -q http://< Network Load Balancer DNS Name >/index.html
```

fluent / fluent-bit

fluentbit missing logs in aws cloudwatch #8631

Bug Report