Closed consideRatio closed 1 year ago
@consideRatio, I believe for the "Update grafana dashboard" action point, what needs to happen is to manually run trigger the workflow in
https://github.com/2i2c-org/infrastructure/blob/master/.github/workflows/deploy-grafana-dashboards.yaml
and make sure all the clusters are listed there. That worflow should know to deploy the latest dashboards in jupyterhub/grafana-dashboards
Note: An manually created dashboards (if any?) in the grafanas (in the JupyterHub Default Dashboards
dir only I belive) will get wiped.
In https://2i2c.freshdesk.com/a/tickets/415 we would have wanted this resolved!
I've used the command cat /dev/zero | head -c 10G | tail
to either have the process in the container killed or the container itself killed by a pod eviction.
I've failed to see the OOMKills dashboard populated with data, because as part of the user pod getting evicted/OOMKilled which I figure should register, I see that the support-prometheus-node-exporter pod that should observe it is getting evicted itself by the node which is low on memory. Doh!
staging 104s Normal Created pod/jupyter-erik-402i2c-2eorg Created container notebook
staging 104s Normal Started pod/jupyter-erik-402i2c-2eorg Started container notebook
staging 104s Normal Pulled pod/jupyter-erik-402i2c-2eorg Container image "quay.io/2i2c/2i2c-hubs-image:69b1f9dff7c7" already present on machine
support 5s Warning Failed pod/support-cryptnono-58v2k Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: container init was OOM-killed (memory limit too low?): unknown
support 4s Warning BackOff pod/support-cryptnono-58v2k Back-off restarting failed container
support 3s Warning BackOff pod/support-prometheus-node-exporter-nxcqs Back-off restarting failed container
support 2s Warning Evicted pod/support-prometheus-node-exporter-nxcqs The node was low on resource: memory.
support 0s Normal SuccessfulDelete daemonset/support-prometheus-node-exporter Deleted pod: support-prometheus-node-exporter-nxcqs
support 0s Warning FailedDaemonPod daemonset/support-prometheus-node-exporter Found failed daemon pod support/support-prometheus-node-exporter-nxcqs on node ip-192-168-29-112.ca-central-1.compute.internal, will try to kill it
So I figure we need some memory requests for the node-exporter part of the support chart, and also for cryptnono.
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system aws-node-d5z62 25m (1%) 0 (0%) 0 (0%) 0 (0%) 14d
kube-system ebs-csi-controller-58b9ff5786-gz62k 60m (3%) 600m (31%) 240Mi (3%) 1536Mi (21%) 14d
kube-system ebs-csi-node-46jfr 30m (1%) 300m (15%) 120Mi (1%) 768Mi (10%) 14d
kube-system kube-proxy-slhsk 100m (5%) 0 (0%) 0 (0%) 0 (0%) 14d
staging jupyter-erik-402i2c-2eorg 50m (2%) 0 (0%) 6979321856 (94%) 8589934592 (115%) 11m
support support-cryptnono-58v2k 0 (0%) 0 (0%) 0 (0%) 0 (0%) 29m
support support-prometheus-node-exporter-d5cgq 0 (0%) 0 (0%) 0 (0%) 0 (0%) 9m34s
It seems that cryptnono requires little to no CPU, but 60Mi of memory. I'll go ahead and trial having cryptnono requests 60Mi and limits to 120Mi, and node exporter requests 20Mi, and limits to 40Mi.
support-cryptnono-bm9fj 1m 60Mi
support-cryptnono-lgc5g 1m 60Mi
support-cryptnono-pqt9h 1m 61Mi
support-cryptnono-rfpxl 1m 60Mi
support-cryptnono-rnlzc 1m 60Mi
support-prometheus-node-exporter-fvvtr 2m 20Mi
support-prometheus-node-exporter-j97mg 1m 12Mi
support-prometheus-node-exporter-lz2fs 5m 18Mi
support-prometheus-node-exporter-m8gl6 1m 11Mi
support-prometheus-node-exporter-s4hdd 1m 20Mi
Having tested these limits, I obseved issues. cryptnono struggled to startup, and prometheus-node-exporter crashed during memory pressure given the constraints. Now I'm testing to also specify cpu: 10m to 100m for both cryptnono and node-exporter, and increased the memory for cryptnono: 64Mi to 256Mi, and node-exporter: 32Mi to 64Mi.
I ended up increasing the node-exporter further: 64Mi to 64Mi.
Following this, I can in my user pod do:
jovyan@jupyter-erik-402i2c-2eorg:~$ cat /dev/zero | head -c 10G | tail
Killed
But this doesn't show up in the OOMKiller dashboard. I'm not confident what should and shouldn't show there. The entire user container wasn't killed after all, just the process within the container. So, there isn't a mention of eviction either.
I managed to get my user pod evicted now as well without the node-exporter being crashed. But, that didn't show up either.
staging 55s Warning Evicted pod/jupyter-erik-402i2c-2eorg The node was low on resource: memory.
staging 55s Normal Killing pod/jupyter-erik-402i2c-2eorg Stopping container notebook
I think perhaps the node-exporter used is decoupled from this dashboard, as well as the node CPU % etc?
In this hub, the node % works. At the same time, in this hub kubectl top node
also works.
That's some super cool testing, @consideRatio! I agree that setting requirements for these support components is a great idea.
I did test them, with some results in https://github.com/jupyterhub/grafana-dashboards/pull/52#issuecomment-1397603086. I tested them by starting a user server, and then basically having a user do something that caused their 'kernel to restart'. It registered as an OOM kill on the node. Process kills based on hitting the container's configured memory limit was what I was primarily interested in, rather than nodes dying based on memory pressure.
@yuvipanda where did you do the tests? I tested on staging.ubc-eoas.2i2c.cloud - AWS EKS. There I can't do kubectl top node
for example, so I was thinking maybe you did it on GCP where such commands works and metrics is being collected perhaps a bit differently?
My tests were done with a pod that consumed almost the full capacity of the node, which made k8s observe memory preassure which triggered pod evictions. I guess you ran tests on GCP where for example a user is limited to 1GB perhaps - less than the nodes memory - and that your tests therefore didn't lead to memory preassure on the node itself.
Hmm... But I got a lot of killed
from within the container as well, similar to what you tested before. And that also wasn't captured in the dashboard so something is up at least on the AWS EKS cluster i tested on.
I guess you ran tests on GCP where for example a user is limited to 1GB perhaps - less than the nodes memory - and that your tests therefore didn't lead to memory preassure on the node itself.
I tested on the openscapes hub, which is EKS. The user was testing code that was using a lot of memory, so I suspect it was trying to make one big malloc
and immediately failing, rather than a number of smaller ones that succeed until memory is full?
I would suggest testing on something with an artificial container memory limit perhaps to see how it goes?
I've extracted a dedicated issue in #2213 for OOMKiller, closing this!
Yuvi has done work recently to provide a dashboard of free space in the NFS storage we used for home folders and the
shared
folder.Tracking free space in NFS
Because of changes I suggested to in the PR adding a dashboard to jupyterhub/grafana-dashboards:
Action points:
Tracking out of memory issues
I'm not sure if this new grafana dashboard has been trialed in our hubs, so let's assume it isn't and verify it:
Action points: