aws / aws-for-fluent-bit

The source of the amazon/aws-for-fluent-bit container image
Apache License 2.0
451 stars 135 forks source link

FluentBit stops sending logs to S3 #732

Open akshay-wani opened 1 year ago

akshay-wani commented 1 year ago

Hello All,

FluentBit stops sending logs to S3 after 6-12 hrs of starting the instance. It is giving the Running upload timer callback (cb_s3_upload).. message continuously in its logs.

Exact logs for the duration when fluentbit stops sending logs to S3 :

[2023/09/23 08:52:04] [debug] [upstream] KA connection #60 to s3.us-west-2.amazonaws.com:443 has been disconnected by the remote service
[2023/09/23 08:52:04] [debug] [socket] could not validate socket status for #60 (don't worry)
[2023/09/23 08:52:16] [debug] [output:s3:s3.0] Running upload timer callback (cb_s3_upload)..
[2023/09/23 08:52:16] [debug] [output:s3:s3.2] Running upload timer callback (cb_s3_upload)..
[2023/09/23 08:52:16] [debug] [output:s3:s3.1] Running upload timer callback (cb_s3_upload)..
[2023/09/23 08:52:36] [debug] [output:s3:s3.0] Running upload timer callback (cb_s3_upload)..
[2023/09/23 08:52:36] [debug] [output:s3:s3.1] Running upload timer callback (cb_s3_upload)..
[2023/09/23 08:52:36] [debug] [output:s3:s3.2] Running upload timer callback (cb_s3_upload)..
[2023/09/23 08:52:56] [debug] [output:s3:s3.1] Running upload timer callback (cb_s3_upload)..
[2023/09/23 08:52:56] [debug] [output:s3:s3.2] Running upload timer callback (cb_s3_upload)..
[2023/09/23 08:52:56] [debug] [output:s3:s3.0] Running upload timer callback (cb_s3_upload)..
[2023/09/23 08:53:16] [debug] [output:s3:s3.0] Running upload timer callback (cb_s3_upload)..
[2023/09/23 08:53:16] [debug] [output:s3:s3.2] Running upload timer callback (cb_s3_upload)..
[2023/09/23 08:53:16] [debug] [output:s3:s3.1] Running upload timer callback (cb_s3_upload)..
[2023/09/23 08:53:16] [debug] [http_client] not using http_proxy for header
[2023/09/23 08:53:16] [debug] [aws_credentials] Requesting credentials from the EKS provider..
[2023/09/23 08:53:16] [debug] [upstream] KA connection #36 to s3.us-west-2.amazonaws.com:443 is now available
[2023/09/23 08:53:16] [debug] [output:s3:s3.1] PutObject http status=200
[2023/09/23 08:53:16] [ info] [output:s3:s3.1] Successfully uploaded object /nginx-logs/nginx/log/containers/2023/09/23/mynginxapp-nginx-9679b4bd42-shxya__nginx/nx9dswLM6-51-16.gz
[2023/09/23 08:53:24] [debug] [upstream] KA connection #36 to s3.us-west-2.amazonaws.com:443 has been disconnected by the remote service
[2023/09/23 08:53:24] [debug] [socket] could not validate socket status for #36 (don't worry)
[2023/09/23 08:53:36] [debug] [output:s3:s3.2] Running upload timer callback (cb_s3_upload)..
[2023/09/23 08:53:36] [debug] [output:s3:s3.0] Running upload timer callback (cb_s3_upload)..
[2023/09/23 08:53:36] [debug] [output:s3:s3.1] Running upload timer callback (cb_s3_upload)..
[2023/09/23 08:53:56] [debug] [output:s3:s3.1] Running upload timer callback (cb_s3_upload)..
[2023/09/23 08:53:56] [debug] [output:s3:s3.2] Running upload timer callback (cb_s3_upload)..
[2023/09/23 08:53:56] [debug] [output:s3:s3.0] Running upload timer callback (cb_s3_upload)..
[2023/09/23 08:54:16] [debug] [output:s3:s3.0] Running upload timer callback (cb_s3_upload)..
[2023/09/23 08:54:16] [debug] [output:s3:s3.1] Running upload timer callback (cb_s3_upload)..
[2023/09/23 08:54:16] [debug] [output:s3:s3.2] Running upload timer callback (cb_s3_upload)..
[2023/09/23 08:54:36] [debug] [output:s3:s3.2] Running upload timer callback (cb_s3_upload)..
[2023/09/23 08:54:36] [debug] [output:s3:s3.0] Running upload timer callback (cb_s3_upload)..
[2023/09/23 08:54:36] [debug] [output:s3:s3.1] Running upload timer callback (cb_s3_upload)..
[2023/09/23 08:54:56] [debug] [output:s3:s3.1] Running upload timer callback (cb_s3_upload)..
[2023/09/23 08:54:56] [debug] [output:s3:s3.0] Running upload timer callback (cb_s3_upload)..
[2023/09/23 08:54:56] [debug] [output:s3:s3.2] Running upload timer callback (cb_s3_upload)..
[2023/09/23 08:55:16] [debug] [output:s3:s3.1] Running upload timer callback (cb_s3_upload)..
[2023/09/23 08:55:16] [debug] [output:s3:s3.2] Running upload timer callback (cb_s3_upload)..

I'm using fluentbit (fluent-bit-0.37.0) official helm charts with public.ecr.aws/aws-observability/aws-for-fluent-bit:stable image.

kind: DaemonSet

replicaCount: 1

image:
  repository: public.ecr.aws/aws-observability/aws-for-fluent-bit
  tag: "stable"
  pullPolicy: Always

testFramework:
  enabled: false
  image:
    repository: busybox
    pullPolicy: Always
    tag: latest

serviceAccount:
  create: false
  annotations: {}
  name: "fluent-bit"

nodeSelector:
  role: "app"

rbac:
  create: true
  nodeAccess: false

flush: 5

metricsPort: 2020

logLevel: debug

config:
  service: |
    [SERVICE]
        Daemon Off
        Flush {{ .Values.flush }}
        Log_Level {{ .Values.logLevel }}
        Parsers_File custom_parsers.conf
        HTTP_Server On
        HTTP_Listen 0.0.0.0
        HTTP_Port {{ .Values.metricsPort }}
        Health_Check On
        storage.sync              normal
        storage.checksum          off
        @INCLUDE application-log.conf

  inputs: |
    [INPUT]
        Name                tail
        Tag                 nginx.*
        Path                /var/log/containers/*nginx*-nginx*
        Exclude_Path        /var/log/containers/*opa*, /var/log/containers/*nginx*init*, /var/log/containers/*nginx*proxy*
        Parser              nginxparser
        Mem_Buf_Limit       50MB
        Buffer_Max_Size     128k
        Skip_Long_Lines     On
        Refresh_Interval    10

    [INPUT]
        Name                tail
        Tag                 nginx-istioproxy.*
        Path                /var/log/containers/*nginx*proxy*
        Exclude_Path        /var/log/containers/*nginx*-nginx*, /var/log/containers/*opa*, /var/log/containers/*nginx*init*
        Parser              istioproxyparser
        Mem_Buf_Limit       50MB
        Buffer_Max_Size     128k
        Skip_Long_Lines     On
        Refresh_Interval    10

    [INPUT]
        Name                tail
        Tag                 nginx-other.*
        Path                /var/log/containers/*nginx*-opa*, /var/log/containers/*nginx*init*
        Exclude_Path        /var/log/containers/*nginx*-nginx*, /var/log/containers/*nginx*proxy*
        Parser              opaparser
        Mem_Buf_Limit       50MB
        Buffer_Max_Size     128k
        Skip_Long_Lines     On
        Refresh_Interval    10

  filters: |
    [FILTER]
        Name                  kubernetes
        Match                 nginx.*
        Kube_URL              [https://kubernetes.default.svc:443](https://kubernetes.default.svc/)
        Merge_Log             On
        Merge_Log_Trim        On
        K8S-Logging.Parser    On
        K8S-Logging.Exclude   On
        Annotations           Off
        Labels                On

    [FILTER]
        Name                  kubernetes
        Match                 nginx-istioproxy.*
        Kube_URL              [https://kubernetes.default.svc:443](https://kubernetes.default.svc/)
        Merge_Log             On
        Merge_Log_Trim        On
        K8S-Logging.Parser    On
        K8S-Logging.Exclude   On
        Annotations           Off
        Labels                On

    [FILTER]
        Name                  kubernetes
        Match                 nginx-other.*
        Kube_URL              [https://kubernetes.default.svc:443](https://kubernetes.default.svc/)
        Merge_Log             On
        Merge_Log_Trim        On
        K8S-Logging.Parser    On
        K8S-Logging.Exclude   On
        Annotations           Off
        Labels                On

    [FILTER]
        Name    modify
        Match   *
        Remove  kubernetes.container_hash
        Remove  kubernetes.docker_id
        Remove  kubernetes.pod_id
        Remove  timestamp
        Remove  stream
        Remove  logtag

    [FILTER]
        Name    modify
        Match   nginx-istioproxy.*
        Add     src nginx-istio-proxy

    [FILTER]
        Name    modify
        Match   nginx-other.*
        Add     src nginx-opa

    [FILTER]
        Name    modify
        Match   nginx.*
        Add     src nginx

  outputs: |
    [OUTPUT]
        Name                          s3
        Match                         nginx.*
        bucket                        nginx-logs
        region                        us-west-2
        use_put_object                On
        compression                   gzip
        retry_limit                   5
        total_file_size               5M
        upload_timeout                2m
        s3_key_format                 /nginx-logs/nginx/$TAG[2]/$TAG[3]/%Y/%m/%d/$TAG[4]__$TAG[5]/$UUID-%M-%S.gz
        s3_key_format_tag_delimiters  ._

    [OUTPUT]
        Name                          s3
        Match                         nginx-other.*
        bucket                        nginx-logs
        region                        us-west-2
        use_put_object                On
        retry_limit                   5
        compression                   gzip
        total_file_size               5M
        upload_timeout                2m
        s3_key_format                 /nginx-logs/nginx-other/$TAG[2]/$TAG[3]/%Y/%m/%d/$TAG[4]__$TAG[5]/$UUID-%M-%S.gz
        s3_key_format_tag_delimiters  ._

    [OUTPUT]
        Name                          s3
        Match                         nginx-istioproxy.*
        bucket                        nginx-logs
        region                        us-west-2
        use_put_object                On
        retry_limit                   5
        compression                   gzip
        total_file_size               5M
        upload_timeout                2m
        s3_key_format                 /nginx-logs/nginx-istioproxy/$TAG[2]/$TAG[3]/%Y/%m/%d/$TAG[4]__$TAG[5]/$UUID-%M-%S.gz
        s3_key_format_tag_delimiters  ._

  customParsers: |
    [PARSER]
        Name    nginxparser
        Format  regex
        Regex   ^(?<timestamp>[^ ]+) (?<stream>stdout|stderr) (?<logtag>[^ ]*) (?<remote>[^ ]*) (?<host>[^ ]*) (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^\"]*?)(?: +\S*)?)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)")

    [PARSER]
        Name    istioproxyparser
        Format  regex
        Regex   ^(?<timestamp>[^ ]+) (?<stream>stdout|stderr) (?<logtag>[^ ]*) \[(?<start_time>[^\]]*)\] "(?<method>\S+) (?<host>[^ ]*) (?<path>[^ ]*) (?<protocol>\S+)" (?<response_code>[^ ]*)  (?<response_flags>[^ ]*) (?<response_code_details>[^ ]*) (?<connection_termination_details>[^ ]*) "(?<upstream_transport_failure_reason>[^ ]*)" (?<bytes_received>[^ ]*) (?<bytes_sent>[^ ]*) (?<duration>[^ ]*) (?<x_envoy_upstream_service_time>[^ ]*) "(?<x_forwarded_for>[^ ]*)" "(?<user_agent>[^\"]*)" "(?<x_request_id>[^\"]*)" "(?<authority>[^ ]*)" "(?<upstream_host>[^ ]*)" (?<upstream_cluster>[^ ]*) (?<upstream_local_address>[^ ]*) (?<downstream_local_address>[^ ]*) (?<downstream_remote_address>[^ ]*) (?<requested_server_name>[^ ]*) (?<route_name>[^  ]*)

    [PARSER]
        Name    opaparser
        Format  regex
        Regex   ^(?<timestamp>[^ ]+) (?<stream>stdout|stderr) (?<logtag>[^ ]*) (?<message>.*)$
        Time_Key    time
        Time_Format %Y-%m-%dT%H:%M:%S.%L%z
        Time_Keep   Off

  upstream: {}

volumeMounts:
  - name: config
    mountPath: /fluent-bit/etc/fluent-bit.conf
    subPath: fluent-bit.conf

daemonSetVolumes:
  - name: varlog
    hostPath:
      path: /var/log
  - name: varlibdockercontainers
    hostPath:
      path: /var/lib/docker/containers
  - name: etcmachineid
    hostPath:
      path: /etc/machine-id
      type: File

daemonSetVolumeMounts:
  - name: varlog
    mountPath: /var/log
  - name: varlibdockercontainers
    mountPath: /var/lib/docker/containers
    readOnly: true
  - name: etcmachineid
    mountPath: /etc/machine-id
    readOnly: true

Fluentbit version for public.ecr.aws/aws-observability/aws-for-fluent-bit:stable is v1.9.10

I tried increasing Buffer_Max_Size and setting auto_retry_requests parameter to true, but it still is not helping.

Please suggest

PettitWesley commented 1 year ago

I don't know why this happened. I would check if Fluent Bit is still reading and ingesting new data or not.

I'd enable debug logging: https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#enable-debug-logging

We have a log loss runbook with steps and queries you can run to check if Fluent Bit is ingesting data, and it requires you enable debug logging first: https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#log-loss-investigation-runbook

akshay-wani commented 1 year ago

Debug logging is enabled. The log I have pasted is of Debug Log level. I do not see any warn, err or error in FluentBit logs.

I would check if Fluent Bit is still reading and ingesting new data or not. --> I'm not sure about it as FluentBit logs are only showing [debug] [output:s3:s3.0] Running upload timer callback (cb_s3_upload).. message in logs after the issue occurs.

akshay-wani commented 1 year ago

One thing I noticed for public.ecr.aws/aws-observability/aws-for-fluent-bit image, I tried checking FluentBit version using ./fluent-bit --version, it shows v1.9.10 only for all images. I checked for images with stable and latest tags. Although the Git commit hash is different.

I also tried checking the git commits hash for versions in https://github.com/aws/aws-for-fluent-bit/tree/mainline repo but not found it.

image

image

PettitWesley commented 1 year ago

@akshay-wani Can you check the AWS version file as explained here: https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#which-version-did-i-deploy

Debug logging is enabled. The log I have pasted is of Debug Log level. I do not see any warn, err or error in FluentBit logs.

I would check if Fluent Bit is still reading and ingesting new data or not. --> I'm not sure about it as FluentBit logs are only showing [debug] [output:s3:s3.0] Running upload timer callback (cb_s3_upload).. message in logs after the issue occurs.

If there is your full debug log output, then per the log loss runbook, since we do not see any of the ingestion messages, Fluent Bit is not ingesting any new data.

I'm not sure why that would happen if your pods are emitting new logs to log files.

brianluong commented 9 months ago

We are experiencing a similar issue where our fluentbit container occasionally becomes unresponsive. During these incidents, the container not only fails to forward logs to S3, but also ceases to generate its own logs. This behavior suggests that the process either crashes or gets halted.

There are no significant fluctuations in resource usage, and the container does not run out of memory.

~Unfortunately, there are no noteworthy logs preceding these occurrences, as we only have 'info' level and higher logs enabled.~

EDIT: we got debug logs, and they are similar to what OP posted:

2024-01-03 23:12:49 UTC | TRACE | INFO | (pkg/trace/info/stats.go:91 in LogAndResetStats) | No data received
[2024/01/03 23:12:55] [debug] [output:s3:s3.1] Running upload timer callback (cb_s3_upload)..

Could anyone provide any guidance or suggestions on this matter?

jvs87 commented 9 months ago

Hello,

Im facing with the same problem. Any idea?

Thanks

akshay-wani commented 9 months ago

Hello,

I was using the aws image for fluentbit, try using official fluentbit image, that solved my issue. Image - cr.fluentbit.io/fluent/fluent-bit