Closed lmuhlha closed 2 years ago
We experienced the same issue when upgrading from 1.5.2 to 1.8.8. One pod would consistently use up to 3GB of memory and then crash. Upping 'Flush' to 8 in the service config helped, but pods are still using 3x more memory than they did in 1.5.2. ``[SERVICE] Flush 8 Log_Level info Daemon Off Parsers_File parsers.conf HTTP_Server On HTTP_Listen 0.0.0.0 HTTP_Port 2020 storage.path /tmp storage.sync normal storage.backlog.mem_limit 100M storage.metrics on
[INPUT] Name tail Tag kube. Path /var/log/containers/xxx*.log Parser docker DB /tmp/flb_kube.xxx.db Mem_Buf_Limit 500MB Skip_Long_Lines On Refresh_Interval 10 storage.type filesystem
[INPUT] Name tail Tag kube. Path /var/log/containers/kube-system*.log Parser docker DB /tmp/flb_kube.kube-system.db Mem_Buf_Limit 500MB Skip_Long_Lines On Refresh_Interval 10 storage.type filesystem
[INPUT] Name tail Tag kube. Path /var/log/containers/cloudability*.log Parser docker DB /tmp/flb_kube.cloudability.db Mem_Buf_Limit 500MB Skip_Long_Lines On Refresh_Interval 10 storage.type filesystem
[INPUT] Name systemd Tag nodes DB /tmp/flb_systemd.db Mem_Buf_Limit 500MB Strip_Underscores On Skip_Long_Lines On Refresh_Interval 10 storage.type filesystem Systemd_Filter _SYSTEMD_UNIT=kubelet.service
[INPUT] Name tail Tag k8s-audit Path /opt/rke/var/log/kube-audit/k8s-audit-log.json Parser k8s-audit DB /tmp/flb_k8s_audit.db Mem_Buf_Limit 500MB Skip_Long_Lines On Refresh_Interval 10 Rotate_Wait 10 storage.type filesystem
[FILTER] Name kubernetes Match kube.* Kube_Tag_Prefix kube.var.log.containers. Merge_Log On Merge_Log_Key log_processed K8S-Logging.Parser On K8S-Logging.Exclude On
[OUTPUT] Name forward Match * Host fluentd-forward.xxx.svc.cluster.local. Port 24224 Retry_Limit 5 `
Same here on 1.8.8
Any update on this? It's happening in 1.8.8
non-debug as well.
@lmuhlha Have you found a workaround for this?
if you have 2.6G of data up in memory and then you aim to convert it to JSON you will exceed 3GB for sure, your mem_buf_limits are too high
@edsiper our mem_buf_limits are 500mb and the OP's are 1mb. If this was just a configuration thing, it would be happening in both versions. When we rolled back to 1.5.2, memory use dropped right back to about 4mb per pod vs the 20mb-3gb that the 1.8.8 version pods used. In 1.8.8, one pod out of three would consistently run up to 3gb within hours while the others would slowly rise up and hang around at 20mb.
@ggallagher0 can you try reproducing the problem by disabling systemd input ? can you help to isolate the plugin triggering the problem
@ggallagher0 can you try reproducing the problem by disabling systemd input ? can you help to isolate the plugin triggering the problem
I have this same issue and I only use the tail
input.
[FILTER]
Name aws
Match *
imds_version v1
az true
ec2_instance_id true
ec2_instance_type true
private_ip true
ami_id true
account_id true
hostname true
vpc_id true
[FILTER]
Name kubernetes
Match ingress-nginx.*
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
Kube_Tag_Prefix ingress-nginx.
Use_Kubelet true
Buffer_Size 0
Merge_Log On
Keep_Log False
[SERVICE]
Flush 5
Grace 120
Log_Level error
Daemon off
Parsers_File parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
storage.metrics On
storage.path /var/log/flb-storage/
@INCLUDE input-kubernetes.conf
@INCLUDE filter-kubernetes.conf
@INCLUDE filter-aws.conf
@INCLUDE output-elasticsearch.conf
@INCLUDE output-s3.conf
[INPUT]
Name tail
Alias ingress_nginx_appdat-system
Tag ingress_<namespace_name>_<pod_name>_<container_name>
Tag_Regex (?<pod_name>[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-
Path /var/log/containers/ingress-nginx-controller*.log
Parser docker
DB /var/log/flb_ingress.db
storage.type filesystem
Docker_Mode On
Skip_Long_Lines On
Refresh_Interval 5
Buffer_Max_Size 1MB
Mem_Buf_Limt 5MB
[OUTPUT]
Name es
Match *
Host ${ELASTICSEARCH_HOST}
Port ${ELASTICSEARCH_PORT}
AWS_Auth ${ELASTICSEARCH_AWS_AUTH}
AWS_Region ${ELASTICSEARCH_AWS_REGION}
TLS On
Generate_ID On
Logstash_Prefix access-logs
Logstash_Format On
Replace_Dots On
Buffer_Size False
Retry_Limit False
storage.total_limit_size 2048M
[OUTPUT]
Name s3
Match *
bucket ${S3_BUCKET_NAME}
region ${S3_BUCKET_REGION}
store_dir /var/log/flb-storage
s3_key_format ${S3_BUCKET_KEY_FORMAT}
s3_key_format_tag_delimiters .-
upload_timeout 5m
Retry_Limit False
storage.total_limit_size 2048M
[PARSER]
Name docker
Format json
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L
Time_Keep On
Any update?
https://github.com/fluent/fluent-bit/issues/4192 may be a related issue.
Same issue with 1.8.9
non-debug.
I wonder which case is the easiest one to reproduce locally, @lmuhlha s seems to be good output wise because it's using the http plugin but it's a bit convoluted configuration wise, @ggallagher0 s is good because it uses simpler inputs and the output plugin is forwarder which means it can be locally set without requiring any api keys.
Have you tried removing those outputs and adding a simple tcp endpooint to see if the leak is still there @NeckBeardPrince?
I'm trying to come up with some ideas on what these cases have in common and what simplifications could be made to prove these ideas, the one thing 2 out of 3 have in common is the Kubernetes filter plugin and all of them use parsers.
Just an update from my end, I've been trying to get the k8s filter to work with my set up but on 1.5.7
I can't seem to get it to connect properly: [ warn] [filter:kubernetes:kubernetes.0] could not get meta for POD ...
If I add the configs to the filter to use Kubelet, FluentBit crashes because I assume the Kubelet features weren't supported in that version yet.
If I use gcr.io/gke-on-prem-release/fluent-bit:v1.8.3-gke.3
(provided by Google from other disucssions), I am able to use Kubelet and connect properly, but we start to see several pods OOMing again.
Re: "I wonder which case is the easiest one to reproduce locally, @lmuhlha s seems to be good output wise because it's using the http plugin but it's a bit convoluted configuration wise," I can try to deploy a simplified config if that helps debug the issue.
So I just tried this again with a simplified config and decreased the Mem_Buf_Limit
and am still seeing the OOM on some pods.
FluentBit version: gcr.io/gke-on-prem-release/fluent-bit:v1.8.3-gke.3
w/ Google's Exporter: gke.gcr.io/fluent-bit-gke-exporter:v0.16.2-gke.0
Config:
fluent-bit.conf: |-
[SERVICE]
Flush 5
Grace 120
Log_Level info
Daemon off
Parsers_File parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_PORT 3020
@INCLUDE containers.input.conf
@INCLUDE filter.conf
@INCLUDE output.conf
containers.input.conf: |-
[INPUT]
Name tail
Alias k8s_container
Tag k8s_container.<namespace_name>.<pod_name>.<container_name>
Tag_Regex (?<pod_name>[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-
Path /var/log/containers/*.log
Parser docker
DB /var/run/google-fluentbit/pos-files/flb_kube.db
Buffer_Max_Size 1MB
Mem_Buf_Limit 1MB
Skip_Long_Lines On
Refresh_Interval 5
filter.conf: |-
[FILTER]
Name kubernetes
Match k8s_container.<namespace_name>.<pod_name>.<container_name>
Kube_URL https://kubernetes.default.svc.cluster.local:443
Merge_Log On
Buffer_Size 0
Use_Kubelet true
Kubelet_Port 10250
output.conf: |-
# Single output for all logs, project log routing handled by sinks in host project
[OUTPUT]
Name http
Alias http-export-all
Match *
Host 127.0.0.1
Port 3021
URI /logs
header_tag FLUENT-TAG
Format msgpack
Retry_Limit 2
parsers.conf: |-
[PARSER]
Name docker
Format json
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L
Time_Keep On
Pod 1: Pod 2:
Happening on 1.7.9 as well
Happening on 1.7.9 as well
Do you mean 1.8.9
? After going back to 1.7.9
I'm no longer having the issue. But 1.8.9
is also having the same problem.
Nope, I actually have the issue in 1.7.9
as well. So far anything I try above 1.5.7
does it, will continue trying things out.
Just tried 1.7.7
with no issue.
Same issue when I use 1.8.10. As you see the graph I posted, what makes me confused is container_memory_working_set_bytes{endpoint="https-metrics", id="/kubepods/pod2cfb2523-0d79-43f5-a2a0-db07e0029bdd", instance="10.34.7.89:10250", job="kubelet", metrics_path="/metrics/cadvisor", namespace="logging", node="ip-10-34-7-89.ec2.internal", pod="fluent-bit-wr5wt", service="kube-prometheus-operator-k-kubelet"}
the metrics has a peak.
After updating to 1.8.12 I don't see the memory leak.
Maybe these two are the same issue: https://github.com/fluent/fluent-bit/issues/5147
After updating to 1.8.12 I don't see the memory leak.
even with 1.8.12, I am facing some problem when turning K8S-Logging.Exclude On in kubernetes filter plugin. It remains constant when I turn this option Off.
Same issue here. Tested with versions 1.8.11 and 1.8.12 and with K8S-Logging.Exclude Off, but the memory always keeps leaking.
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale
label.
This issue was closed because it has been stalled for 5 days with no activity.
Bug Report
Describe the bug fluent/fluent-bit:1.8.7-debug@sha256:024748e4aa934d5b53a713341608b7ba801d41a170f9870fdf67f4032a20146f
To Reproduce
Expected behavior Deploying fluent/fluent-bit:1.8.7-debug@sha256:024748e4aa934d5b53a713341608b7ba801d41a170f9870fdf67f4032a20146f with a specified amount of memory will work and not constantly increase / OOM.
Screenshots
Your Environment
Configuration:
Additional context