Potential memory leak in v1.8.7 debug

lmuhlha commented 3 years ago

Bug Report

Describe the bug fluent/fluent-bit:1.8.7-debug@sha256:024748e4aa934d5b53a713341608b7ba801d41a170f9870fdf67f4032a20146f

To Reproduce

Rubular link if applicable:
Example log message if applicable:
```
stream of "OOMKilling" warnings
```
Steps to reproduce the problem: Deploy fluent/fluent-bit:1.8.7-debug@sha256:024748e4aa934d5b53a713341608b7ba801d41a170f9870fdf67f4032a20146f and wait 10-15 mins. Container will OOM.

Expected behavior Deploying fluent/fluent-bit:1.8.7-debug@sha256:024748e4aa934d5b53a713341608b7ba801d41a170f9870fdf67f4032a20146f with a specified amount of memory will work and not constantly increase / OOM.

Screenshots

Your Environment

Version used: fluent/fluent-bit:1.8.7-debug@sha256:024748e4aa934d5b53a713341608b7ba801d41a170f9870fdf67f4032a20146f

Configuration:

fluent-bit.conf: |-
[SERVICE]
    Flush         5
    Grace         120
    Log_Level     debug
    Daemon        off
    Parsers_File  parsers.conf
    HTTP_Server   On
    HTTP_Listen   0.0.0.0
    HTTP_PORT     3020

@INCLUDE containers.input.conf
@INCLUDE system.input.conf
@INCLUDE filter.conf
@INCLUDE output.conf

containers.input.conf: |-
[INPUT]
    Name             tail
    Alias            k8s_container
    Tag              k8s_container.<namespace_name>.<pod_name>.<container_name>
    Tag_Regex        (?<pod_name>[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-
    Path             /var/log/containers/*.log
    DB               /var/run/google-fluentbit/pos-files/flb_kube.db
    Buffer_Max_Size  1MB
    Mem_Buf_Limit    50MB
    Skip_Long_Lines  On
    Refresh_Interval 5
    Read_from_Head   True

system.input.conf: |-
# Example:
# Dec 21 23:17:22 gke-foo-1-1-4b5cbd14-node-4eoj startupscript: Finished running startup script /var/run/google.startup.script
[INPUT]
    Name   tail
    Alias  syslog
    Parser syslog
    Path   /var/log/startupscript.log
    DB     /var/log/startupscript.db
    Alias  startupscript
    Tag    startupscript

[INPUT]
    Name    tail
    Alias   docker
    Path    /var/log/docker.log
    Tag     docker
    Parser  docker
    Mem_Buf_Limit    1MB
    Skip_Long_Lines  On
    Refresh_Interval 1

[INPUT]
    Name  tail
    Alias etcd
    Path  /var/log/etcd.log
    Tag   etcd
    Mem_Buf_Limit    1MB
    Skip_Long_Lines  On
    Refresh_Interval 1

[INPUT]
    Name             tail
    Alias            kubelet
    Path             /var/log/kubelet.log
    Tag              kubelet
    Multiline        off
    Parser_Firstline firstline
    Parser_1         format1
    Mem_Buf_Limit    1MB
    Skip_Long_Lines  On
    Refresh_Interval 1

# Example:
# I1118 21:26:53.975789       6 proxier.go:1096] Port "nodePort for kube-system/default-http-backend:http" (:31429/tcp) was open before and is still needed
[INPUT]
    Name            tail
    Alias           kube-proxy
    Tag             kube-proxy
    Path            /var/log/kube-proxy.log
    DB              /var/log/kube-proxy.db
    Buffer_Max_Size 1MB
    Mem_Buf_Limit   1MB
    Refresh_Interval 1
    Parser          glog

[INPUT]
    Name             tail
    Alias            kube-apiserver
    Path             /var/log/kube-apiserver.log
    Tag              kube-apiserver
    Multiline        off
    Parser_Firstline firstline
    Parser_1         format1
    Mem_Buf_Limit    1MB
    Skip_Long_Lines  On
    Refresh_Interval 1

[INPUT]
    Name             tail
    Alias            kube-controller-manager
    Path             /var/log/kube-controller-manager.log
    Tag              kube-controller-manager
    Multiline        off
    Parser_Firstline firstline
    Parser_1         format1
    Mem_Buf_Limit    1MB
    Skip_Long_Lines  On
    Refresh_Interval 1

[INPUT]
    Name             tail
    Alias            kube-scheduler
    Path             /var/log/kube-scheduler.log
    Tag              kube-scheduler
    Multiline        off
    Parser_Firstline firstline
    Parser_1         format1
    Mem_Buf_Limit    1MB
    Skip_Long_Lines  On
    Refresh_Interval 1

[INPUT]
    Name             tail
    Alias            rescheduler
    Path             /var/log/rescheduler.log
    Tag              rescheduler
    Multiline        off
    Parser_Firstline firstline
    Parser_1         format1
    Mem_Buf_Limit    1MB
    Skip_Long_Lines  On
    Refresh_Interval 1

[INPUT]
    Name             tail
    Alias            glbc
    Path             /var/log/glbc.log
    Tag              glbc
    Multiline        off
    Parser_Firstline firstline
    Parser_1         format1
    Mem_Buf_Limit    1MB
    Skip_Long_Lines  On
    Refresh_Interval 1

[INPUT]
    Name             tail
    Alias            cluster-autoscaler
    Path             /var/log/cluster-autoscaler.log
    Tag              cluster-autoscaler
    Multiline        off
    Parser_Firstline firstline
    Parser_1         format1
    Mem_Buf_Limit    1MB
    Skip_Long_Lines  On
    Refresh_Interval 1

# Logs from systemd-journal for interesting services.
[INPUT]
    Name           systemd
    Alias          sysd-docker
    Tag            docker
    Systemd_Filter _SYSTEMD_UNIT=docker.service
    Path           /var/log/journal
    DB             /var/log/gcp-journald-docker.db
    Read_from_head  true
    Buffer_Max_Size 1MB
    Mem_Buf_Limit   1MB
    Refresh_Interval 1

[INPUT]
    Name           systemd
    Alias          sysd-container-runtime
    Tag            container-runtime
    Systemd_Filter _SYSTEMD_UNIT=containerd.service
    Path           /var/log/journal
    DB             /var/log/gcp-journald-container-runtime.db
    Read_from_head true
    Buffer_Max_Size 1MB
    Mem_Buf_Limit   1MB
    Refresh_Interval 1

[INPUT]
    Name            systemd
    Alias           sysd-kubelet
    Tag             kubelet
    Systemd_Filter  _SYSTEMD_UNIT=kubelet.service
    Path            /var/log/journal
    DB              /var/log/gcp-journald-kubelet.db
    Read_from_head  true
    Buffer_Max_Size 1MB
    Mem_Buf_Limit   1MB
    Refresh_Interval 1

[INPUT]
    Name           systemd
    Alias          sysd-node-problem-detector
    Tag            node-problem-detector
    Systemd_Filter _SYSTEMD_UNIT=node-problem-detector.service
    Path           /var/log/journal
    DB             /var/log/gcp-journald-node-problem-detector.db
    Read_from_head  true
    Buffer_Max_Size 1MB
    Mem_Buf_Limit   1MB
    Refresh_Interval 1

filter.conf: |-

[FILTER]
    Name         parser
    Match        k8s_container.*
    Key_Name     log
    Reserve_Data True
    Parser       docker
    Parser       containerd

[FILTER]
    Name        modify
    Match       *
    Hard_rename log message

[FILTER]
    Name         parser
    Match        k8s_container.*
    Key_Name     message
    Reserve_Data True
    Parser       glog
    Parser       json

# level is a common synonym for severity,
# the default field name in libraries such as GoLang's zap.
# populate severity with level, if severity does not exist.
[FILTER]
    Name        modify
    Match       k8s_container.*
    Copy        level severity

output.conf: |-

# handle namespaces in droplist first
{% for namespace in log_droplist %}
[OUTPUT]
    Name  null
    Alias null-{{namespace}}
    Match k8s_container.{{namespace}}.*
{% endfor %}

# Single output for all logs, project log routing handled by sinks in host project
[OUTPUT]
    Name                       http
    Alias                      http-export-all
    Match                      *
    Host                       127.0.0.1
    Port                       3021
    URI                        /logs
    header_tag                 FLUENT-TAG
    Format                     msgpack
    Retry_Limit                2

parsers.conf: |-
[PARSER]
    Name        docker
    Format      json
    Time_Key    time
    Time_Format %Y-%m-%dT%H:%M:%S.%L%z

[PARSER]
    Name        containerd
    Format      regex
    Regex       ^(?<time>.+) (?<stream>stdout|stderr) [^ ]* (?<log>.*)$
    Time_Key    time
    Time_Format %Y-%m-%dT%H:%M:%S.%L%z

[PARSER]
    Name        json
    Format      json

[PARSER]
    Name        glog
    Format      regex
    Regex       ^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source_file>[^ \]]+)\:(?<source_line>\d+)\]\s(?<message>.*)$
    Time_Key    time
    Time_Format %m%d %H:%M:%S.%L

[PARSER]
    Name        syslog
    Format      regex
    Regex       ^\<(?<pri>[0-9]+)\>(?<time>[^ ]* {1,2}[^ ]* [^ ]*) (?<host>[^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?(?:[^\:]*\:)? *(?<message>.*)$
    Time_Key    time
    Time_Format %b %d %H:%M:%S

[PARSER]
    Name firstline
    Format regex
    Regex  /^\w\d{4}/

Environment name and version (e.g. Kubernetes? What version?): Kubernetes
Server type and version:
Operating System and version: "Debian GNU/Linux 10 (buster)"
Filters and plugins: See config above

Additional context

ggallagher0 commented 3 years ago

We experienced the same issue when upgrading from 1.5.2 to 1.8.8. One pod would consistently use up to 3GB of memory and then crash. Upping 'Flush' to 8 in the service config helped, but pods are still using 3x more memory than they did in 1.5.2. ``[SERVICE] Flush 8 Log_Level info Daemon Off Parsers_File parsers.conf HTTP_Server On HTTP_Listen 0.0.0.0 HTTP_Port 2020 storage.path /tmp storage.sync normal storage.backlog.mem_limit 100M storage.metrics on

[INPUT] Name tail Tag kube. Path /var/log/containers/xxx*.log Parser docker DB /tmp/flb_kube.xxx.db Mem_Buf_Limit 500MB Skip_Long_Lines On Refresh_Interval 10 storage.type filesystem

[INPUT] Name tail Tag kube. Path /var/log/containers/kube-system*.log Parser docker DB /tmp/flb_kube.kube-system.db Mem_Buf_Limit 500MB Skip_Long_Lines On Refresh_Interval 10 storage.type filesystem

[INPUT] Name tail Tag kube. Path /var/log/containers/cloudability*.log Parser docker DB /tmp/flb_kube.cloudability.db Mem_Buf_Limit 500MB Skip_Long_Lines On Refresh_Interval 10 storage.type filesystem

[INPUT] Name systemd Tag nodes DB /tmp/flb_systemd.db Mem_Buf_Limit 500MB Strip_Underscores On Skip_Long_Lines On Refresh_Interval 10 storage.type filesystem Systemd_Filter _SYSTEMD_UNIT=kubelet.service

[INPUT] Name tail Tag k8s-audit Path /opt/rke/var/log/kube-audit/k8s-audit-log.json Parser k8s-audit DB /tmp/flb_k8s_audit.db Mem_Buf_Limit 500MB Skip_Long_Lines On Refresh_Interval 10 Rotate_Wait 10 storage.type filesystem

[FILTER] Name kubernetes Match kube.* Kube_Tag_Prefix kube.var.log.containers. Merge_Log On Merge_Log_Key log_processed K8S-Logging.Parser On K8S-Logging.Exclude On

[OUTPUT] Name forward Match * Host fluentd-forward.xxx.svc.cluster.local. Port 24224 Retry_Limit 5 `

NeckBeardPrince commented 3 years ago

Same here on 1.8.8

NeckBeardPrince commented 3 years ago

Any update on this? It's happening in 1.8.8 non-debug as well.

NeckBeardPrince commented 3 years ago

@lmuhlha Have you found a workaround for this?

edsiper commented 3 years ago

if you have 2.6G of data up in memory and then you aim to convert it to JSON you will exceed 3GB for sure, your mem_buf_limits are too high

ggallagher0 commented 3 years ago

@edsiper our mem_buf_limits are 500mb and the OP's are 1mb. If this was just a configuration thing, it would be happening in both versions. When we rolled back to 1.5.2, memory use dropped right back to about 4mb per pod vs the 20mb-3gb that the 1.8.8 version pods used. In 1.8.8, one pod out of three would consistently run up to 3gb within hours while the others would slowly rise up and hang around at 20mb.

edsiper commented 3 years ago

@ggallagher0 can you try reproducing the problem by disabling systemd input ? can you help to isolate the plugin triggering the problem

NeckBeardPrince commented 3 years ago

@ggallagher0 can you try reproducing the problem by disabling systemd input ? can you help to isolate the plugin triggering the problem

I have this same issue and I only use the tail input.

[FILTER]
    Name              aws
    Match             *
    imds_version      v1
    az                true
    ec2_instance_id   true
    ec2_instance_type true
    private_ip        true
    ami_id            true
    account_id        true
    hostname          true
    vpc_id            true
[FILTER]
    Name                kubernetes
    Match               ingress-nginx.*
    Kube_URL            https://kubernetes.default.svc:443
    Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
    Kube_Tag_Prefix     ingress-nginx.
    Use_Kubelet         true
    Buffer_Size         0
    Merge_Log           On
    Keep_Log            False
[SERVICE]
    Flush             5
    Grace             120
    Log_Level         error
    Daemon            off
    Parsers_File      parsers.conf
    HTTP_Server       On
    HTTP_Listen       0.0.0.0
    HTTP_Port         2020
    storage.metrics   On
    storage.path      /var/log/flb-storage/

@INCLUDE input-kubernetes.conf
@INCLUDE filter-kubernetes.conf
@INCLUDE filter-aws.conf
@INCLUDE output-elasticsearch.conf
@INCLUDE output-s3.conf
[INPUT]
    Name              tail
    Alias             ingress_nginx_appdat-system
    Tag               ingress_<namespace_name>_<pod_name>_<container_name>
    Tag_Regex         (?<pod_name>[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-
    Path              /var/log/containers/ingress-nginx-controller*.log
    Parser            docker
    DB                /var/log/flb_ingress.db
    storage.type      filesystem
    Docker_Mode       On
    Skip_Long_Lines   On
    Refresh_Interval  5
    Buffer_Max_Size   1MB
    Mem_Buf_Limt      5MB
[OUTPUT]
    Name                      es
    Match                     *
    Host                      ${ELASTICSEARCH_HOST}
    Port                      ${ELASTICSEARCH_PORT}
    AWS_Auth                  ${ELASTICSEARCH_AWS_AUTH}
    AWS_Region                ${ELASTICSEARCH_AWS_REGION}
    TLS                       On
    Generate_ID               On
    Logstash_Prefix           access-logs
    Logstash_Format           On
    Replace_Dots              On
    Buffer_Size               False
    Retry_Limit               False
    storage.total_limit_size  2048M
[OUTPUT]
    Name                          s3
    Match                         *
    bucket                        ${S3_BUCKET_NAME}
    region                        ${S3_BUCKET_REGION}
    store_dir                     /var/log/flb-storage
    s3_key_format                 ${S3_BUCKET_KEY_FORMAT}
    s3_key_format_tag_delimiters  .-
    upload_timeout                5m
    Retry_Limit                   False
    storage.total_limit_size      2048M
[PARSER]
    Name        docker
    Format      json
    Time_Key    time
    Time_Format %Y-%m-%dT%H:%M:%S.%L
    Time_Keep   On

NeckBeardPrince commented 3 years ago

Any update?

gabegorelick commented 3 years ago

https://github.com/fluent/fluent-bit/issues/4192 may be a related issue.

NeckBeardPrince commented 2 years ago

Same issue with 1.8.9 non-debug.

leonardo-albertovich commented 2 years ago

I wonder which case is the easiest one to reproduce locally, @lmuhlha s seems to be good output wise because it's using the http plugin but it's a bit convoluted configuration wise, @ggallagher0 s is good because it uses simpler inputs and the output plugin is forwarder which means it can be locally set without requiring any api keys.

Have you tried removing those outputs and adding a simple tcp endpooint to see if the leak is still there @NeckBeardPrince?

I'm trying to come up with some ideas on what these cases have in common and what simplifications could be made to prove these ideas, the one thing 2 out of 3 have in common is the Kubernetes filter plugin and all of them use parsers.

lmuhlha commented 2 years ago

Just an update from my end, I've been trying to get the k8s filter to work with my set up but on 1.5.7 I can't seem to get it to connect properly: [ warn] [filter:kubernetes:kubernetes.0] could not get meta for POD ... If I add the configs to the filter to use Kubelet, FluentBit crashes because I assume the Kubelet features weren't supported in that version yet. If I use gcr.io/gke-on-prem-release/fluent-bit:v1.8.3-gke.3 (provided by Google from other disucssions), I am able to use Kubelet and connect properly, but we start to see several pods OOMing again.

Re: "I wonder which case is the easiest one to reproduce locally, @lmuhlha s seems to be good output wise because it's using the http plugin but it's a bit convoluted configuration wise," I can try to deploy a simplified config if that helps debug the issue.

lmuhlha commented 2 years ago

So I just tried this again with a simplified config and decreased the Mem_Buf_Limit and am still seeing the OOM on some pods. FluentBit version: gcr.io/gke-on-prem-release/fluent-bit:v1.8.3-gke.3 w/ Google's Exporter: gke.gcr.io/fluent-bit-gke-exporter:v0.16.2-gke.0 Config:

 fluent-bit.conf: |-
    [SERVICE]
        Flush         5
        Grace         120
        Log_Level     info
        Daemon        off
        Parsers_File  parsers.conf
        HTTP_Server   On
        HTTP_Listen   0.0.0.0
        HTTP_PORT     3020
    @INCLUDE containers.input.conf
    @INCLUDE filter.conf
    @INCLUDE output.conf
  containers.input.conf: |-
    [INPUT]
        Name             tail
        Alias            k8s_container
        Tag              k8s_container.<namespace_name>.<pod_name>.<container_name>
        Tag_Regex        (?<pod_name>[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-
        Path             /var/log/containers/*.log
        Parser           docker
        DB               /var/run/google-fluentbit/pos-files/flb_kube.db
        Buffer_Max_Size  1MB
        Mem_Buf_Limit    1MB
        Skip_Long_Lines  On
        Refresh_Interval 5
  filter.conf: |-
    [FILTER]
        Name                kubernetes
        Match               k8s_container.<namespace_name>.<pod_name>.<container_name>
        Kube_URL            https://kubernetes.default.svc.cluster.local:443
        Merge_Log           On
        Buffer_Size         0
        Use_Kubelet         true
        Kubelet_Port        10250
  output.conf: |-
    # Single output for all logs, project log routing handled by sinks in host project
    [OUTPUT]
        Name                       http
        Alias                      http-export-all
        Match                      *
        Host                       127.0.0.1
        Port                       3021
        URI                        /logs
        header_tag                 FLUENT-TAG
        Format                     msgpack
        Retry_Limit                2
  parsers.conf: |-
    [PARSER]
        Name        docker
        Format      json
        Time_Key    time
        Time_Format %Y-%m-%dT%H:%M:%S.%L
        Time_Keep    On

Pod 1: Screen Shot 2021-11-18 at 6 12 29 PM Pod 2: Screen Shot 2021-11-18 at 6 14 14 PM

lmuhlha commented 2 years ago

Happening on 1.7.9 as well Screen Shot 2021-11-22 at 7 49 06 PM

NeckBeardPrince commented 2 years ago

Happening on 1.7.9 as well

Do you mean 1.8.9? After going back to 1.7.9 I'm no longer having the issue. But 1.8.9 is also having the same problem.

lmuhlha commented 2 years ago

Nope, I actually have the issue in 1.7.9 as well. So far anything I try above 1.5.7 does it, will continue trying things out.

lmuhlha commented 2 years ago

Just tried 1.7.7 with no issue.

triThirty commented 2 years ago

Same issue when I use 1.8.10. As you see the graph I posted, what makes me confused is container_memory_working_set_bytes{endpoint="https-metrics", id="/kubepods/pod2cfb2523-0d79-43f5-a2a0-db07e0029bdd", instance="10.34.7.89:10250", job="kubelet", metrics_path="/metrics/cadvisor", namespace="logging", node="ip-10-34-7-89.ec2.internal", pod="fluent-bit-wr5wt", service="kube-prometheus-operator-k-kubelet"} the metrics has a peak.

namevic commented 2 years ago

After updating to 1.8.12 I don't see the memory leak.

KrishnaKant1509 commented 2 years ago

Maybe these two are the same issue: https://github.com/fluent/fluent-bit/issues/5147

KrishnaKant1509 commented 2 years ago

After updating to 1.8.12 I don't see the memory leak.

even with 1.8.12, I am facing some problem when turning K8S-Logging.Exclude On in kubernetes filter plugin. It remains constant when I turn this option Off.

danielserrao commented 2 years ago

Same issue here. Tested with versions 1.8.11 and 1.8.12 and with K8S-Logging.Exclude Off, but the memory always keeps leaking.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

github-actions[bot] commented 2 years ago

This issue was closed because it has been stalled for 5 days with no activity.

fluent / fluent-bit

Potential memory leak in v1.8.7 debug #4211

Bug Report