grafana: ensure functionality of OOMKiller dashboard

consideRatio commented 1 year ago

Tracking out of memory issues

https://github.com/jupyterhub/grafana-dashboards/pull/52

Action points

[x] Understand for what clusters this dashboard is broken
[x] Fix it
[ ] Bonus - clarify understanding if the metrics represent either or both full container kills (associated with pod evictions) or in-container process kills.

Background understanding

If something is being OOMKilled, it can be done in two ways.

Inside container OOMKilled. If the pod has a memory limit configured, and that limit is reached. Then a process is killed within the container typically, such as a kernel or terminal command run.
If the pod has a memory request configured, and an undefined memory limit or a memory limit higher than the memory request, then a pod that is above its request in memory usage can be evicted from the node if the node experience memory pressure. When this happens, the entire container is shut down as compared to something running in it.

I'm not sure if what is monitored by this dashboard. Is it both full container kills or in-container process kills?

consideRatio commented 1 year ago

Here are notes extracted from https://github.com/2i2c-org/infrastructure/issues/2098#issuecomment-1412989796 when I tried investigate this but didn't complete the investigation. Note also how Yuvi provided comments below this extracted comment also.

OOMKills

I've used the command cat /dev/zero | head -c 10G | tail to either have the process in the container killed or the container itself killed by a pod eviction.

I've failed to see the OOMKills dashboard populated with data, because as part of the user pod getting evicted/OOMKilled which I figure should register, I see that the support-prometheus-node-exporter pod that should observe it is getting evicted itself by the node which is low on memory. Doh!

staging       104s        Normal    Created                     pod/jupyter-erik-402i2c-2eorg                          Created container notebook
staging       104s        Normal    Started                     pod/jupyter-erik-402i2c-2eorg                          Started container notebook
staging       104s        Normal    Pulled                      pod/jupyter-erik-402i2c-2eorg                          Container image "quay.io/2i2c/2i2c-hubs-image:69b1f9dff7c7" already present on machine
support       5s          Warning   Failed                      pod/support-cryptnono-58v2k                            Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: container init was OOM-killed (memory limit too low?): unknown
support       4s          Warning   BackOff                     pod/support-cryptnono-58v2k                            Back-off restarting failed container
support       3s          Warning   BackOff                     pod/support-prometheus-node-exporter-nxcqs             Back-off restarting failed container
support       2s          Warning   Evicted                     pod/support-prometheus-node-exporter-nxcqs             The node was low on resource: memory.
support       0s          Normal    SuccessfulDelete            daemonset/support-prometheus-node-exporter             Deleted pod: support-prometheus-node-exporter-nxcqs
support       0s          Warning   FailedDaemonPod             daemonset/support-prometheus-node-exporter             Found failed daemon pod support/support-prometheus-node-exporter-nxcqs on node ip-192-168-29-112.ca-central-1.compute.internal, will try to kill it

So I figure we need some memory requests for the node-exporter part of the support chart, and also for cryptnono.

  Namespace                   Name                                      CPU Requests  CPU Limits  Memory Requests   Memory Limits      Age
  ---------                   ----                                      ------------  ----------  ---------------   -------------      ---
  kube-system                 aws-node-d5z62                            25m (1%)      0 (0%)      0 (0%)            0 (0%)             14d
  kube-system                 ebs-csi-controller-58b9ff5786-gz62k       60m (3%)      600m (31%)  240Mi (3%)        1536Mi (21%)       14d
  kube-system                 ebs-csi-node-46jfr                        30m (1%)      300m (15%)  120Mi (1%)        768Mi (10%)        14d
  kube-system                 kube-proxy-slhsk                          100m (5%)     0 (0%)      0 (0%)            0 (0%)             14d
  staging                     jupyter-erik-402i2c-2eorg                 50m (2%)      0 (0%)      6979321856 (94%)  8589934592 (115%)  11m
  support                     support-cryptnono-58v2k                   0 (0%)        0 (0%)      0 (0%)            0 (0%)             29m
  support                     support-prometheus-node-exporter-d5cgq    0 (0%)        0 (0%)      0 (0%)            0 (0%)             9m34s

It seems that cryptnono requires little to no CPU, but 60Mi of memory. I'll go ahead and trial having cryptnono requests 60Mi and limits to 120Mi, and node exporter requests 20Mi, and limits to 40Mi.

support-cryptnono-bm9fj                             1m           60Mi            
support-cryptnono-lgc5g                             1m           60Mi            
support-cryptnono-pqt9h                             1m           61Mi            
support-cryptnono-rfpxl                             1m           60Mi            
support-cryptnono-rnlzc                             1m           60Mi                 
support-prometheus-node-exporter-fvvtr              2m           20Mi            
support-prometheus-node-exporter-j97mg              1m           12Mi            
support-prometheus-node-exporter-lz2fs              5m           18Mi            
support-prometheus-node-exporter-m8gl6              1m           11Mi            
support-prometheus-node-exporter-s4hdd              1m           20Mi

Having tested these limits, I obseved issues. cryptnono struggled to startup, and prometheus-node-exporter crashed during memory pressure given the constraints. Now I'm testing to also specify cpu: 10m to 100m for both cryptnono and node-exporter, and increased the memory for cryptnono: 64Mi to 256Mi, and node-exporter: 32Mi to 64Mi.

I ended up increasing the node-exporter further: 64Mi to 64Mi.

Following this, I can in my user pod do:

jovyan@jupyter-erik-402i2c-2eorg:~$ cat /dev/zero | head -c 10G | tail
Killed

But this doesn't show up in the OOMKiller dashboard. I'm not confident what should and shouldn't show there. The entire user container wasn't killed after all, just the process within the container. So, there isn't a mention of eviction either.

I managed to get my user pod evicted now as well without the node-exporter being crashed. But, that didn't show up either.

staging       55s         Warning   Evicted                     pod/jupyter-erik-402i2c-2eorg                           The node was low on resource: memory.
staging       55s         Normal    Killing                     pod/jupyter-erik-402i2c-2eorg                           Stopping container notebook

I think perhaps the node-exporter used is decoupled from this dashboard, as well as the node CPU % etc?

grafana.ubc-eoas.2i2c.cloud metrics

2i2c.pilot.2i2c.cloud metrics

In this hub, the node % works. At the same time, in this hub kubectl top node also works.

consideRatio commented 1 year ago

Resolved by fixing the selector of a k8s Service for prometheus-node-exporter, see https://github.com/2i2c-org/infrastructure/issues/2191#issuecomment-1434907786

yuvipanda commented 1 year ago

I very much appreciate your thoroughness erik!

2i2c-org / infrastructure