Closed janario closed 3 years ago
We've had a scenario where there were a lot of pod(started from cron) in
Pending
state. Around 9kI know it was an internal scenario problem, but because of that we notice that fluentd started to crash (oomkiller)
Trying to understand the scenario and removing parts of our fluentd configuration we notice that filter_metadata was the problem and because we had too many pods
How do you know it was the fluent-plugin-kubernetes_metadata_filter which was the problem? Was the OOM kill stacktrace in this plugin code?
It would be good to have a low memory consumption
Have you tried adjusting cache_size
and cache_ttl
?
I started to remove piece by piece from my fluentd.conf
Basically my config is:
<source> for apps
<source> for kube-system
<source> for one file not managed by kubernetes
<filter kubemetadata for apps and kube-system
<match to send to cloudwatch logs
removing the sources it kept crashing, then I removed just the filter and it stopped to crash
I didn't try with the cache_size
options
my oom-killer logs
Apr 24 11:51:07 ip-10-0-74-72 kernel: filter_kuberne* invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=975
Apr 24 11:51:07 ip-10-0-74-72 kernel: filter_kuberne* cpuset=4dd475111b9c266dadd3132047c2baba4f7afe2ec9e7d895d0efe76f9806d0cd mems_allowed=0
Apr 24 11:51:07 ip-10-0-74-72 kernel: CPU: 1 PID: 28750 Comm: filter_kuberne* Not tainted 4.14.97-90.72.amzn2.x86_64 #1
Apr 24 11:51:07 ip-10-0-74-72 kernel: Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
Apr 24 11:51:07 ip-10-0-74-72 kernel: Call Trace:
Apr 24 11:51:07 ip-10-0-74-72 kernel: dump_stack+0x5c/0x82
Apr 24 11:51:07 ip-10-0-74-72 kernel: dump_header+0x94/0x229
Apr 24 11:51:07 ip-10-0-74-72 kernel: oom_kill_process+0x213/0x410
Apr 24 11:51:07 ip-10-0-74-72 kernel: out_of_memory+0x2af/0x4d0
Apr 24 11:51:07 ip-10-0-74-72 kernel: mem_cgroup_out_of_memory+0x49/0x80
Apr 24 11:51:07 ip-10-0-74-72 kernel: mem_cgroup_oom_synchronize+0x2ed/0x330
Apr 24 11:51:07 ip-10-0-74-72 kernel: ? mem_cgroup_css_online+0x30/0x30
Apr 24 11:51:07 ip-10-0-74-72 kernel: pagefault_out_of_memory+0x32/0x77
Apr 24 11:51:07 ip-10-0-74-72 kernel: __do_page_fault+0x4b4/0x4c0
Apr 24 11:51:07 ip-10-0-74-72 kernel: ? page_fault+0x2f/0x50
Apr 24 11:51:07 ip-10-0-74-72 kernel: page_fault+0x45/0x50
Apr 24 11:51:07 ip-10-0-74-72 kernel: RIP: 4000:0xffffffffffffffff
Apr 24 11:51:07 ip-10-0-74-72 kernel: RSP: 1700000:00007f268a5fde78 EFLAGS: 7f2688dd9000
Apr 24 11:51:07 ip-10-0-74-72 kernel: Task in /kubepods/burstable/podc43ee131-6686-11e9-8e21-06eeaf1192dc/4dd475111b9c266dadd3132047c2baba4f7afe2ec9e7d895d0efe76f9806d0cd killed as a result of limit of /kubepods/burstable/podc43ee131-6686-11e9-8e21-06eeaf1192dc
Apr 24 11:51:07 ip-10-0-74-72 kernel: memory: usage 524288kB, limit 524288kB, failcnt 1864
Apr 24 11:51:07 ip-10-0-74-72 kernel: memory+swap: usage 524288kB, limit 9007199254740988kB, failcnt 0
Apr 24 11:51:07 ip-10-0-74-72 kernel: kmem: usage 5152kB, limit 9007199254740988kB, failcnt 0
Apr 24 11:51:07 ip-10-0-74-72 kernel: Memory cgroup stats for /kubepods/burstable/podc43ee131-6686-11e9-8e21-06eeaf1192dc: cache:0KB rss:0KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
Apr 24 11:51:07 ip-10-0-74-72 kernel: Memory cgroup stats for /kubepods/burstable/podc43ee131-6686-11e9-8e21-06eeaf1192dc/3ac28d0df3c378e3218f0e2df1a3993aacb2d229b0b7f540a48f4f87d64eed61: cache:0KB rss:44KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:44KB inactive_file:0KB active_file:0KB unevictable:0KB
Apr 24 11:51:07 ip-10-0-74-72 kernel: Memory cgroup stats for /kubepods/burstable/podc43ee131-6686-11e9-8e21-06eeaf1192dc/4dd475111b9c266dadd3132047c2baba4f7afe2ec9e7d895d0efe76f9806d0cd: cache:0KB rss:519092KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:519092KB inactive_file:0KB active_file:0KB unevictable:0KB
Apr 24 11:51:07 ip-10-0-74-72 kernel: [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
Apr 24 11:51:07 ip-10-0-74-72 kernel: [26696] 0 26696 256 1 4 2 0 -998 pause
Apr 24 11:51:07 ip-10-0-74-72 kernel: [26957] 0 26957 4170 312 11 3 0 975 tini
Apr 24 11:51:07 ip-10-0-74-72 kernel: [26985] 0 26985 154525 111157 296 4 0 975 ruby2.3
Apr 24 11:51:07 ip-10-0-74-72 kernel: [27107] 0 27107 6687 532 18 3 0 975 bash
Apr 24 11:51:07 ip-10-0-74-72 kernel: [27610] 0 27610 64586 22055 130 3 0 975 ruby2.3
Apr 24 11:51:07 ip-10-0-74-72 kernel: Memory cgroup out of memory: Kill process 26985 (ruby2.3) score 1824 or sacrifice child
Apr 24 11:51:07 ip-10-0-74-72 kernel: Killed process 27610 (ruby2.3) total-vm:258344kB, anon-rss:80628kB, file-rss:7592kB, shmem-rss:0kB
Apr 24 11:51:10 ip-10-0-74-72 kernel: oom_reaper: reaped process 27610 (ruby2.3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
I'm closing this issue. Please open a new issue if cache_size
and cache_ttl
do not solve the oom problem.
Ok, tried with cache_size
and cache_ttl
. Same error
I created the 5k pods with
pending-pod.yaml
# kubectl create ns pending
# kubectl apply -f pending-pod.yaml
# wait until 5k kubectl -n pending get pods | wc -l
apiVersion: apps/v1
kind: Deployment
metadata:
name: pending
namespace: pending
spec:
replicas: 5000
selector:
matchLabels:
app: pending
template:
metadata:
labels:
app: pending
spec:
containers:
- name: pending
image: fluent/fluentd-kubernetes-daemonset:v1.3-debian-cloudwatch-1
volumeMounts:
- name: vol
mountPath: /invalid
subPath: invalid
volumes:
- name: vol
persistentVolumeClaim:
claimName: invalid-pvc
And waited until all of them are created. (they will be on state pending, but what maters here is the quantity of pods)
then I created a pod with fluentd and only kubernetes metadata
fluentd-pod.yaml
# kubectl create ns fluentd
# kubectl apply -f fluentd-pod.yaml
# it will take some time to start
apiVersion: v1
kind: ConfigMap
metadata:
name: fluentd
namespace: fluentd
data:
fluent.conf: |
<match fluent.**>
@type null
</match>
<filter apps.**>
@type kubernetes_metadata
cache_size 10
#cache_ttl 60
</filter>
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: fluentd
namespace: fluentd
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: fluentd
namespace: fluentd
rules:
- apiGroups: [""]
resources: ["namespaces", "pods"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: fluentd
namespace: fluentd
subjects:
- kind: ServiceAccount
name: fluentd
namespace: fluentd
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: fluentd
---
apiVersion: v1
kind: Pod
metadata:
name: fluentd
namespace: fluentd
spec:
serviceAccountName: fluentd
initContainers:
- name: copy-fluentd-config
image: busybox
command: ['sh', '-c', 'cp /config-volume/* /etc/fluentd']
volumeMounts:
- mountPath: /config-volume
name: config-volume
- mountPath: /etc/fluentd
name: config
containers:
- name: fluentd
image: fluent/fluentd-kubernetes-daemonset:v1.3-debian-cloudwatch-1
imagePullPolicy: Always
env:
- name: AWS_REGION
value: eu-central-1
- name: LOG_GROUP_NAME
value: kubernetes
- name: FLUENT_UID
value: "0"
- name: RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR
value: "0.8"
- name: K8S_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: K8S_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: K8S_NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
resources:
limits:
cpu: 100m
memory: 300Mi
volumeMounts:
- name: config
mountPath: /fluentd/etc
terminationGracePeriodSeconds: 30
volumes:
- name: config
emptyDir: {}
- name: config-volume
configMap:
name: fluentd
Because of the quantity of pods, the cluster starts to be a little slow, fluentd will start in around 2m and crash after 2 minutes running
After the restart you can see on describe of the pod
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Wed, 24 Apr 2019 23:22:38 +0200
Finished: Wed, 24 Apr 2019 23:23:51 +0200
and at the logs (in my case /var/log/messages)
Apr 24 21:23:46 ip-172-20-4-124 kernel: filter_kuberne* invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-998
Apr 24 21:23:46 ip-172-20-4-124 kernel: filter_kuberne* cpuset=7d1ef59f779cc7ac6c7de2926e06b081c09765403d827b56f26f43a58364fa4a mems_allowed=0
Apr 24 21:23:46 ip-172-20-4-124 kernel: CPU: 0 PID: 24366 Comm: filter_kuberne* Not tainted 4.14.97-90.72.amzn2.x86_64 #1
Apr 24 21:23:46 ip-172-20-4-124 kernel: Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
Apr 24 21:23:46 ip-172-20-4-124 kernel: Call Trace:
Apr 24 21:23:46 ip-172-20-4-124 kernel: dump_stack+0x5c/0x82
Apr 24 21:23:46 ip-172-20-4-124 kernel: dump_header+0x94/0x229
Apr 24 21:23:46 ip-172-20-4-124 kernel: oom_kill_process+0x213/0x410
Apr 24 21:23:46 ip-172-20-4-124 kernel: out_of_memory+0x2af/0x4d0
Apr 24 21:23:46 ip-172-20-4-124 kernel: mem_cgroup_out_of_memory+0x49/0x80
Apr 24 21:23:46 ip-172-20-4-124 kernel: mem_cgroup_oom_synchronize+0x2ed/0x330
Apr 24 21:23:46 ip-172-20-4-124 kernel: ? mem_cgroup_css_online+0x30/0x30
Apr 24 21:23:46 ip-172-20-4-124 kernel: pagefault_out_of_memory+0x32/0x77
Apr 24 21:23:46 ip-172-20-4-124 kernel: __do_page_fault+0x4b4/0x4c0
Apr 24 21:23:46 ip-172-20-4-124 kernel: ? page_fault+0x2f/0x50
Apr 24 21:23:46 ip-172-20-4-124 kernel: page_fault+0x45/0x50
Apr 24 21:23:46 ip-172-20-4-124 kernel: RIP: af416800:0x7f7e9c380fc0
Apr 24 21:23:46 ip-172-20-4-124 kernel: RSP: 0120:00007f7ea5806000 EFLAGS: 000000a0
Apr 24 21:23:46 ip-172-20-4-124 kernel: Task in /kubepods/pod709a51df-66d6-11e9-8111-027eee44ea72/c8bb0194b765371387720dce02218838b80c9c9320bf9a1ab6812d8ef209e17f killed as a result of limit of /kubepods/pod709a
51df-66d6-11e9-8111-027eee44ea72
Apr 24 21:23:46 ip-172-20-4-124 kernel: memory: usage 307200kB, limit 307200kB, failcnt 96
Apr 24 21:23:46 ip-172-20-4-124 kernel: memory+swap: usage 307200kB, limit 9007199254740988kB, failcnt 0
Apr 24 21:23:46 ip-172-20-4-124 kernel: kmem: usage 3800kB, limit 9007199254740988kB, failcnt 0
Apr 24 21:23:46 ip-172-20-4-124 kernel: Memory cgroup stats for /kubepods/pod709a51df-66d6-11e9-8111-027eee44ea72: cache:0KB rss:0KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inacti
ve_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
Apr 24 21:23:46 ip-172-20-4-124 kernel: Memory cgroup stats for /kubepods/pod709a51df-66d6-11e9-8111-027eee44ea72/7d1ef59f779cc7ac6c7de2926e06b081c09765403d827b56f26f43a58364fa4a: cache:0KB rss:303356KB rss_huge
:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:303356KB inactive_file:0KB active_file:0KB unevictable:0KB
Apr 24 21:23:46 ip-172-20-4-124 kernel: [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
Apr 24 21:23:46 ip-172-20-4-124 kernel: [24043] 0 24043 256 1 4 2 0 -998 pause
Apr 24 21:23:46 ip-172-20-4-124 kernel: [24285] 0 24285 4170 349 11 3 0 -998 tini
Apr 24 21:23:46 ip-172-20-4-124 kernel: [24308] 0 24308 79101 42881 151 3 0 -998 ruby2.3
Apr 24 21:23:46 ip-172-20-4-124 kernel: [24372] 0 24372 71192 36727 139 4 0 -998 ruby2.3
Apr 24 21:23:46 ip-172-20-4-124 kernel: Memory cgroup out of memory: Kill process 24043 (pause) score 0 or sacrifice child
Apr 24 21:23:46 ip-172-20-4-124 kernel: Killed process 24043 (pause) total-vm:1024kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
Apr 24 21:23:48 ip-172-20-4-124 kernel: oom_reaper: reaped process 24043 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Apr 24 21:23:49 ip-172-20-4-124 kernel: filter_kuberne* invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-998
Apr 24 21:23:49 ip-172-20-4-124 kernel: filter_kuberne* cpuset=7d1ef59f779cc7ac6c7de2926e06b081c09765403d827b56f26f43a58364fa4a mems_allowed=0
Apr 24 21:23:49 ip-172-20-4-124 kernel: CPU: 0 PID: 24366 Comm: filter_kuberne* Not tainted 4.14.97-90.72.amzn2.x86_64 #1
Apr 24 21:23:49 ip-172-20-4-124 kernel: Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
Apr 24 21:23:49 ip-172-20-4-124 kernel: Call Trace:
Apr 24 21:23:49 ip-172-20-4-124 kernel: dump_stack+0x5c/0x82
Apr 24 21:23:49 ip-172-20-4-124 kernel: dump_header+0x94/0x229
Apr 24 21:23:49 ip-172-20-4-124 kernel: oom_kill_process+0x213/0x410
Apr 24 21:23:49 ip-172-20-4-124 kernel: out_of_memory+0x2af/0x4d0
Apr 24 21:23:49 ip-172-20-4-124 kernel: mem_cgroup_out_of_memory+0x49/0x80
Apr 24 21:23:49 ip-172-20-4-124 kernel: mem_cgroup_oom_synchronize+0x2ed/0x330
Apr 24 21:23:49 ip-172-20-4-124 kernel: ? mem_cgroup_css_online+0x30/0x30
Apr 24 21:23:49 ip-172-20-4-124 kernel: pagefault_out_of_memory+0x32/0x77
Apr 24 21:23:49 ip-172-20-4-124 kernel: __do_page_fault+0x4b4/0x4c0
Apr 24 21:23:49 ip-172-20-4-124 kernel: ? page_fault+0x2f/0x50
Apr 24 21:23:49 ip-172-20-4-124 kernel: page_fault+0x45/0x50
Apr 24 21:23:49 ip-172-20-4-124 kernel: RIP: af416800:0x7f7e9c38afc0
Apr 24 21:23:49 ip-172-20-4-124 kernel: RSP: 0120:00007f7ea5806000 EFLAGS: 000000a0
Apr 24 21:23:49 ip-172-20-4-124 kernel: Task in /kubepods/pod709a51df-66d6-11e9-8111-027eee44ea72/7d1ef59f779cc7ac6c7de2926e06b081c09765403d827b56f26f43a58364fa4a killed as a result of limit of /kubepods/pod709a
51df-66d6-11e9-8111-027eee44ea72
Apr 24 21:23:49 ip-172-20-4-124 kernel: memory: usage 307200kB, limit 307200kB, failcnt 132
Apr 24 21:23:49 ip-172-20-4-124 kernel: memory+swap: usage 307200kB, limit 9007199254740988kB, failcnt 0
Apr 24 21:23:49 ip-172-20-4-124 kernel: kmem: usage 3768kB, limit 9007199254740988kB, failcnt 0
Apr 24 21:23:49 ip-172-20-4-124 kernel: Memory cgroup stats for /kubepods/pod709a51df-66d6-11e9-8111-027eee44ea72: cache:0KB rss:0KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inacti
ve_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
Apr 24 21:23:49 ip-172-20-4-124 kernel: Memory cgroup stats for /kubepods/pod709a51df-66d6-11e9-8111-027eee44ea72/c8bb0194b765371387720dce02218838b80c9c9320bf9a1ab6812d8ef209e17f: cache:0KB rss:0KB rss_huge:0KB
shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
Apr 24 21:23:49 ip-172-20-4-124 kernel: Memory cgroup stats for /kubepods/pod709a51df-66d6-11e9-8111-027eee44ea72/7d1ef59f779cc7ac6c7de2926e06b081c09765403d827b56f26f43a58364fa4a: cache:0KB rss:303432KB rss_huge
:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:303432KB inactive_file:0KB active_file:0KB unevictable:0KB
Apr 24 21:23:49 ip-172-20-4-124 kernel: [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
Apr 24 21:23:49 ip-172-20-4-124 kernel: [24285] 0 24285 4170 349 11 3 0 -998 tini
Apr 24 21:23:49 ip-172-20-4-124 kernel: [24308] 0 24308 79101 42881 151 3 0 -998 ruby2.3
Apr 24 21:23:49 ip-172-20-4-124 kernel: [24372] 0 24372 71192 36727 139 4 0 -998 ruby2.3
Apr 24 21:23:49 ip-172-20-4-124 kernel: Memory cgroup out of memory: Kill process 24285 (tini) score 0 or sacrifice child
Apr 24 21:23:49 ip-172-20-4-124 kernel: Killed process 24308 (ruby2.3) total-vm:316404kB, anon-rss:163352kB, file-rss:8172kB, shmem-rss:0kB
hmm - then maybe it has nothing to do with the lru cache - even a setting of 1000 (the default) should not cause an oom in this case. so, not sure what the problem is. Can you afford to increase your fluentd memory? if so, can you eventually increase the fluentd memory to the point where you do not get an oom? also, are you using jemalloc with fluentd?
My guess is when it consumes the kube api to load the pods
But first of all, my scenario of 5k pods only happened by a mistake in dev environment and a cron stated too many pods in 'Pending' this won't happen on a my real scenario
I opened the issue as an improvements to try to keep the memory usage as low as possible. We have noticed that fluentd in general consumes too much memory and we are trying to understand which parts of it and why
Improvements at fluentd-kubernetes-metadata would be very welcome for the whole stack :-)
@janario I think you have to ask yourself what is reasonable for fluentd given what you are seeing. You are restricting the entire process to 300M. This memory needs to account for all that is used by: the ruby runtime, fluent's pipelines, in memory caches, processing, etc, etc. Additionally, this plugin adds caching of the labels (and annotations if configured) from every pod spec. I would imagine with a simple back of the napkin calculation you could easily justify the metadata cache alone eating all of the 300M. If you desire the metadata, then you will need budget for it in the collector.
Feel free to submit any PRs to improve the caching mechanism's memory usage
I agree 300m maybe is not the best value
but when you think the cluster has a lot of pods, not all of them will run on the same worknode, so not all of them will be managed by the fluentd running on the same worknode (daemonset)
The problem I see here is that I have to increase fluentd memory according my whole cluster size, not my worknode pods
I agree 300m maybe is not the best value
The cache is LRU. You might consider actually doing the opposite of what @richm suggested and lowering the number of entries since you are trying to limit the memory usage. The consequences here, however, are that you will be placing more load on the API server.
but when you think the cluster has a lot of pods, not all of them will run on the same worknode, so not all of them will be managed by the fluentd running on the same worknode (daemonset)
Fluentd does not manage any of the pods; it is the runtime responsible for managent. Additionally, the runtime spreads pods across the cluster and fluent as noted is a daemonset which means it only needs to cache meta for the pods which are scheduled on its node, not for the entire cluster. The LRU cache will evict meta for pods which fall to the "bottom" of the cache. I believe lowering the max entry counter TTL may help with the issue you are seeing.
One other place which may warrent checking is the stats cache. It keeps some counts which in the grand scheme of the issue probably isn't much but I do not recall what measures are taken to evict that cache.
The problem I see here is that I have to increase fluentd memory according my whole cluster size, not my worknode pods
This is why the platform gives you the option to require (min) memory and restrict (or not) memory. What you are saying is exactly the reason you would not restrict memory; as the workload increases, allow the platform to give fluent more memory.
Let me reiterate that this feature comes at a price: memory. I fully understand it is undesireable to expect an infra component to eat up the precious memory that would otherwise be available to your other workloads. If you want this metadata you either need to:
Sorry for the delay
Sorry if I could not be clear, what I meant is: Fluentd(filter) resources should not be increased as the cluster grow, but as the worknode supports it
I've done a PR, maybe with it keeps more clear https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/pull/177
It seems that get_pods(limit: 1)
is not limiting, I would like to ask someones help with this :)
(I'm not that familiar with ruby)
I get it now - https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/pull/177 explains it
closing fixed by #189
We've had a scenario where there were a lot of pod(started from cron) in
Pending
state. Around 9kI know it was an internal scenario problem, but because of that we notice that fluentd started to crash (oomkiller)
Trying to understand the scenario and removing parts of our fluentd configuration we notice that filter_metadata was the problem and because we had too many pods
It would be good to have a low memory consumption