fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.86k stars 1.59k forks source link

Potential memory leak in v1.8.7 debug #4211

Closed lmuhlha closed 2 years ago

lmuhlha commented 3 years ago

Bug Report

Describe the bug fluent/fluent-bit:1.8.7-debug@sha256:024748e4aa934d5b53a713341608b7ba801d41a170f9870fdf67f4032a20146f

To Reproduce

Expected behavior Deploying fluent/fluent-bit:1.8.7-debug@sha256:024748e4aa934d5b53a713341608b7ba801d41a170f9870fdf67f4032a20146f with a specified amount of memory will work and not constantly increase / OOM.

Screenshots

image

Your Environment

Additional context

ggallagher0 commented 3 years ago

We experienced the same issue when upgrading from 1.5.2 to 1.8.8. One pod would consistently use up to 3GB of memory and then crash. Upping 'Flush' to 8 in the service config helped, but pods are still using 3x more memory than they did in 1.5.2. ``[SERVICE] Flush 8 Log_Level info Daemon Off Parsers_File parsers.conf HTTP_Server On HTTP_Listen 0.0.0.0 HTTP_Port 2020 storage.path /tmp storage.sync normal storage.backlog.mem_limit 100M storage.metrics on

[INPUT] Name tail Tag kube. Path /var/log/containers/xxx*.log Parser docker DB /tmp/flb_kube.xxx.db Mem_Buf_Limit 500MB Skip_Long_Lines On Refresh_Interval 10 storage.type filesystem

[INPUT] Name tail Tag kube. Path /var/log/containers/kube-system*.log Parser docker DB /tmp/flb_kube.kube-system.db Mem_Buf_Limit 500MB Skip_Long_Lines On Refresh_Interval 10 storage.type filesystem

[INPUT] Name tail Tag kube. Path /var/log/containers/cloudability*.log Parser docker DB /tmp/flb_kube.cloudability.db Mem_Buf_Limit 500MB Skip_Long_Lines On Refresh_Interval 10 storage.type filesystem

[INPUT] Name systemd Tag nodes DB /tmp/flb_systemd.db Mem_Buf_Limit 500MB Strip_Underscores On Skip_Long_Lines On Refresh_Interval 10 storage.type filesystem Systemd_Filter _SYSTEMD_UNIT=kubelet.service

[INPUT] Name tail Tag k8s-audit Path /opt/rke/var/log/kube-audit/k8s-audit-log.json Parser k8s-audit DB /tmp/flb_k8s_audit.db Mem_Buf_Limit 500MB Skip_Long_Lines On Refresh_Interval 10 Rotate_Wait 10 storage.type filesystem

[FILTER] Name kubernetes Match kube.* Kube_Tag_Prefix kube.var.log.containers. Merge_Log On Merge_Log_Key log_processed K8S-Logging.Parser On K8S-Logging.Exclude On

[OUTPUT] Name forward Match * Host fluentd-forward.xxx.svc.cluster.local. Port 24224 Retry_Limit 5 `

NeckBeardPrince commented 3 years ago

Same here on 1.8.8

NeckBeardPrince commented 3 years ago

Any update on this? It's happening in 1.8.8 non-debug as well.

NeckBeardPrince commented 3 years ago

@lmuhlha Have you found a workaround for this?

edsiper commented 3 years ago

if you have 2.6G of data up in memory and then you aim to convert it to JSON you will exceed 3GB for sure, your mem_buf_limits are too high

ggallagher0 commented 3 years ago

@edsiper our mem_buf_limits are 500mb and the OP's are 1mb. If this was just a configuration thing, it would be happening in both versions. When we rolled back to 1.5.2, memory use dropped right back to about 4mb per pod vs the 20mb-3gb that the 1.8.8 version pods used. In 1.8.8, one pod out of three would consistently run up to 3gb within hours while the others would slowly rise up and hang around at 20mb.

edsiper commented 3 years ago

@ggallagher0 can you try reproducing the problem by disabling systemd input ? can you help to isolate the plugin triggering the problem

NeckBeardPrince commented 3 years ago

@ggallagher0 can you try reproducing the problem by disabling systemd input ? can you help to isolate the plugin triggering the problem

I have this same issue and I only use the tail input.

[FILTER]
    Name              aws
    Match             *
    imds_version      v1
    az                true
    ec2_instance_id   true
    ec2_instance_type true
    private_ip        true
    ami_id            true
    account_id        true
    hostname          true
    vpc_id            true
[FILTER]
    Name                kubernetes
    Match               ingress-nginx.*
    Kube_URL            https://kubernetes.default.svc:443
    Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
    Kube_Tag_Prefix     ingress-nginx.
    Use_Kubelet         true
    Buffer_Size         0
    Merge_Log           On
    Keep_Log            False
[SERVICE]
    Flush             5
    Grace             120
    Log_Level         error
    Daemon            off
    Parsers_File      parsers.conf
    HTTP_Server       On
    HTTP_Listen       0.0.0.0
    HTTP_Port         2020
    storage.metrics   On
    storage.path      /var/log/flb-storage/

@INCLUDE input-kubernetes.conf
@INCLUDE filter-kubernetes.conf
@INCLUDE filter-aws.conf
@INCLUDE output-elasticsearch.conf
@INCLUDE output-s3.conf
[INPUT]
    Name              tail
    Alias             ingress_nginx_appdat-system
    Tag               ingress_<namespace_name>_<pod_name>_<container_name>
    Tag_Regex         (?<pod_name>[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-
    Path              /var/log/containers/ingress-nginx-controller*.log
    Parser            docker
    DB                /var/log/flb_ingress.db
    storage.type      filesystem
    Docker_Mode       On
    Skip_Long_Lines   On
    Refresh_Interval  5
    Buffer_Max_Size   1MB
    Mem_Buf_Limt      5MB
[OUTPUT]
    Name                      es
    Match                     *
    Host                      ${ELASTICSEARCH_HOST}
    Port                      ${ELASTICSEARCH_PORT}
    AWS_Auth                  ${ELASTICSEARCH_AWS_AUTH}
    AWS_Region                ${ELASTICSEARCH_AWS_REGION}
    TLS                       On
    Generate_ID               On
    Logstash_Prefix           access-logs
    Logstash_Format           On
    Replace_Dots              On
    Buffer_Size               False
    Retry_Limit               False
    storage.total_limit_size  2048M
[OUTPUT]
    Name                          s3
    Match                         *
    bucket                        ${S3_BUCKET_NAME}
    region                        ${S3_BUCKET_REGION}
    store_dir                     /var/log/flb-storage
    s3_key_format                 ${S3_BUCKET_KEY_FORMAT}
    s3_key_format_tag_delimiters  .-
    upload_timeout                5m
    Retry_Limit                   False
    storage.total_limit_size      2048M
[PARSER]
    Name        docker
    Format      json
    Time_Key    time
    Time_Format %Y-%m-%dT%H:%M:%S.%L
    Time_Keep   On
NeckBeardPrince commented 3 years ago

Any update?

gabegorelick commented 3 years ago

https://github.com/fluent/fluent-bit/issues/4192 may be a related issue.

NeckBeardPrince commented 2 years ago

Same issue with 1.8.9 non-debug.

leonardo-albertovich commented 2 years ago

I wonder which case is the easiest one to reproduce locally, @lmuhlha s seems to be good output wise because it's using the http plugin but it's a bit convoluted configuration wise, @ggallagher0 s is good because it uses simpler inputs and the output plugin is forwarder which means it can be locally set without requiring any api keys.

Have you tried removing those outputs and adding a simple tcp endpooint to see if the leak is still there @NeckBeardPrince?

I'm trying to come up with some ideas on what these cases have in common and what simplifications could be made to prove these ideas, the one thing 2 out of 3 have in common is the Kubernetes filter plugin and all of them use parsers.

lmuhlha commented 2 years ago

Just an update from my end, I've been trying to get the k8s filter to work with my set up but on 1.5.7 I can't seem to get it to connect properly: [ warn] [filter:kubernetes:kubernetes.0] could not get meta for POD ... If I add the configs to the filter to use Kubelet, FluentBit crashes because I assume the Kubelet features weren't supported in that version yet. If I use gcr.io/gke-on-prem-release/fluent-bit:v1.8.3-gke.3 (provided by Google from other disucssions), I am able to use Kubelet and connect properly, but we start to see several pods OOMing again.

Re: "I wonder which case is the easiest one to reproduce locally, @lmuhlha s seems to be good output wise because it's using the http plugin but it's a bit convoluted configuration wise," I can try to deploy a simplified config if that helps debug the issue.

lmuhlha commented 2 years ago

So I just tried this again with a simplified config and decreased the Mem_Buf_Limit and am still seeing the OOM on some pods. FluentBit version: gcr.io/gke-on-prem-release/fluent-bit:v1.8.3-gke.3 w/ Google's Exporter: gke.gcr.io/fluent-bit-gke-exporter:v0.16.2-gke.0 Config:

 fluent-bit.conf: |-
    [SERVICE]
        Flush         5
        Grace         120
        Log_Level     info
        Daemon        off
        Parsers_File  parsers.conf
        HTTP_Server   On
        HTTP_Listen   0.0.0.0
        HTTP_PORT     3020
    @INCLUDE containers.input.conf
    @INCLUDE filter.conf
    @INCLUDE output.conf
  containers.input.conf: |-
    [INPUT]
        Name             tail
        Alias            k8s_container
        Tag              k8s_container.<namespace_name>.<pod_name>.<container_name>
        Tag_Regex        (?<pod_name>[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-
        Path             /var/log/containers/*.log
        Parser           docker
        DB               /var/run/google-fluentbit/pos-files/flb_kube.db
        Buffer_Max_Size  1MB
        Mem_Buf_Limit    1MB
        Skip_Long_Lines  On
        Refresh_Interval 5
  filter.conf: |-
    [FILTER]
        Name                kubernetes
        Match               k8s_container.<namespace_name>.<pod_name>.<container_name>
        Kube_URL            https://kubernetes.default.svc.cluster.local:443
        Merge_Log           On
        Buffer_Size         0
        Use_Kubelet         true
        Kubelet_Port        10250
  output.conf: |-
    # Single output for all logs, project log routing handled by sinks in host project
    [OUTPUT]
        Name                       http
        Alias                      http-export-all
        Match                      *
        Host                       127.0.0.1
        Port                       3021
        URI                        /logs
        header_tag                 FLUENT-TAG
        Format                     msgpack
        Retry_Limit                2
  parsers.conf: |-
    [PARSER]
        Name        docker
        Format      json
        Time_Key    time
        Time_Format %Y-%m-%dT%H:%M:%S.%L
        Time_Keep    On

Pod 1: Screen Shot 2021-11-18 at 6 12 29 PM Pod 2: Screen Shot 2021-11-18 at 6 14 14 PM

lmuhlha commented 2 years ago

Happening on 1.7.9 as well Screen Shot 2021-11-22 at 7 49 06 PM

NeckBeardPrince commented 2 years ago

Happening on 1.7.9 as well Screen Shot 2021-11-22 at 7 49 06 PM

Do you mean 1.8.9? After going back to 1.7.9 I'm no longer having the issue. But 1.8.9 is also having the same problem.

lmuhlha commented 2 years ago

Nope, I actually have the issue in 1.7.9 as well. So far anything I try above 1.5.7 does it, will continue trying things out.

lmuhlha commented 2 years ago

Just tried 1.7.7 with no issue.

triThirty commented 2 years ago

Same issue when I use 1.8.10. As you see the graph I posted, what makes me confused is container_memory_working_set_bytes{endpoint="https-metrics", id="/kubepods/pod2cfb2523-0d79-43f5-a2a0-db07e0029bdd", instance="10.34.7.89:10250", job="kubelet", metrics_path="/metrics/cadvisor", namespace="logging", node="ip-10-34-7-89.ec2.internal", pod="fluent-bit-wr5wt", service="kube-prometheus-operator-k-kubelet"} the metrics has a peak.

Screen Shot 2021-12-17 at 2 50 03 PM
namevic commented 2 years ago

After updating to 1.8.12 I don't see the memory leak.

KrishnaKant1509 commented 2 years ago

Maybe these two are the same issue: https://github.com/fluent/fluent-bit/issues/5147

KrishnaKant1509 commented 2 years ago

After updating to 1.8.12 I don't see the memory leak.

even with 1.8.12, I am facing some problem when turning K8S-Logging.Exclude On in kubernetes filter plugin. It remains constant when I turn this option Off.

danielserrao commented 2 years ago

Same issue here. Tested with versions 1.8.11 and 1.8.12 and with K8S-Logging.Exclude Off, but the memory always keeps leaking.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

github-actions[bot] commented 2 years ago

This issue was closed because it has been stalled for 5 days with no activity.