Closed consideRatio closed 1 year ago
Here are notes extracted from https://github.com/2i2c-org/infrastructure/issues/2098#issuecomment-1412989796 when I tried investigate this but didn't complete the investigation. Note also how Yuvi provided comments below this extracted comment also.
I've used the command cat /dev/zero | head -c 10G | tail
to either have the process in the container killed or the container itself killed by a pod eviction.
I've failed to see the OOMKills dashboard populated with data, because as part of the user pod getting evicted/OOMKilled which I figure should register, I see that the support-prometheus-node-exporter pod that should observe it is getting evicted itself by the node which is low on memory. Doh!
staging 104s Normal Created pod/jupyter-erik-402i2c-2eorg Created container notebook
staging 104s Normal Started pod/jupyter-erik-402i2c-2eorg Started container notebook
staging 104s Normal Pulled pod/jupyter-erik-402i2c-2eorg Container image "quay.io/2i2c/2i2c-hubs-image:69b1f9dff7c7" already present on machine
support 5s Warning Failed pod/support-cryptnono-58v2k Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: container init was OOM-killed (memory limit too low?): unknown
support 4s Warning BackOff pod/support-cryptnono-58v2k Back-off restarting failed container
support 3s Warning BackOff pod/support-prometheus-node-exporter-nxcqs Back-off restarting failed container
support 2s Warning Evicted pod/support-prometheus-node-exporter-nxcqs The node was low on resource: memory.
support 0s Normal SuccessfulDelete daemonset/support-prometheus-node-exporter Deleted pod: support-prometheus-node-exporter-nxcqs
support 0s Warning FailedDaemonPod daemonset/support-prometheus-node-exporter Found failed daemon pod support/support-prometheus-node-exporter-nxcqs on node ip-192-168-29-112.ca-central-1.compute.internal, will try to kill it
So I figure we need some memory requests for the node-exporter part of the support chart, and also for cryptnono.
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system aws-node-d5z62 25m (1%) 0 (0%) 0 (0%) 0 (0%) 14d
kube-system ebs-csi-controller-58b9ff5786-gz62k 60m (3%) 600m (31%) 240Mi (3%) 1536Mi (21%) 14d
kube-system ebs-csi-node-46jfr 30m (1%) 300m (15%) 120Mi (1%) 768Mi (10%) 14d
kube-system kube-proxy-slhsk 100m (5%) 0 (0%) 0 (0%) 0 (0%) 14d
staging jupyter-erik-402i2c-2eorg 50m (2%) 0 (0%) 6979321856 (94%) 8589934592 (115%) 11m
support support-cryptnono-58v2k 0 (0%) 0 (0%) 0 (0%) 0 (0%) 29m
support support-prometheus-node-exporter-d5cgq 0 (0%) 0 (0%) 0 (0%) 0 (0%) 9m34s
It seems that cryptnono requires little to no CPU, but 60Mi of memory. I'll go ahead and trial having cryptnono requests 60Mi and limits to 120Mi, and node exporter requests 20Mi, and limits to 40Mi.
support-cryptnono-bm9fj 1m 60Mi
support-cryptnono-lgc5g 1m 60Mi
support-cryptnono-pqt9h 1m 61Mi
support-cryptnono-rfpxl 1m 60Mi
support-cryptnono-rnlzc 1m 60Mi
support-prometheus-node-exporter-fvvtr 2m 20Mi
support-prometheus-node-exporter-j97mg 1m 12Mi
support-prometheus-node-exporter-lz2fs 5m 18Mi
support-prometheus-node-exporter-m8gl6 1m 11Mi
support-prometheus-node-exporter-s4hdd 1m 20Mi
Having tested these limits, I obseved issues. cryptnono struggled to startup, and prometheus-node-exporter crashed during memory pressure given the constraints. Now I'm testing to also specify cpu: 10m to 100m for both cryptnono and node-exporter, and increased the memory for cryptnono: 64Mi to 256Mi, and node-exporter: 32Mi to 64Mi.
I ended up increasing the node-exporter further: 64Mi to 64Mi.
Following this, I can in my user pod do:
jovyan@jupyter-erik-402i2c-2eorg:~$ cat /dev/zero | head -c 10G | tail
Killed
But this doesn't show up in the OOMKiller dashboard. I'm not confident what should and shouldn't show there. The entire user container wasn't killed after all, just the process within the container. So, there isn't a mention of eviction either.
I managed to get my user pod evicted now as well without the node-exporter being crashed. But, that didn't show up either.
staging 55s Warning Evicted pod/jupyter-erik-402i2c-2eorg The node was low on resource: memory.
staging 55s Normal Killing pod/jupyter-erik-402i2c-2eorg Stopping container notebook
I think perhaps the node-exporter used is decoupled from this dashboard, as well as the node CPU % etc?
In this hub, the node % works. At the same time, in this hub kubectl top node
also works.
Resolved by fixing the selector
of a k8s Service for prometheus-node-exporter, see https://github.com/2i2c-org/infrastructure/issues/2191#issuecomment-1434907786
I very much appreciate your thoroughness erik!
Tracking out of memory issues
Action points
Background understanding
If something is being OOMKilled, it can be done in two ways.
I'm not sure if what is monitored by this dashboard. Is it both full container kills or in-container process kills?