2i2c-org / infrastructure

Infrastructure for configuring and deploying our community JupyterHubs.
https://infrastructure.2i2c.org
BSD 3-Clause "New" or "Revised" License
105 stars 64 forks source link

Verify new jupterhub/grafana-dashboard feature: free space #2098

Closed consideRatio closed 1 year ago

consideRatio commented 1 year ago

Yuvi has done work recently to provide a dashboard of free space in the NFS storage we used for home folders and the shared folder.

Tracking free space in NFS

Because of changes I suggested to in the PR adding a dashboard to jupyterhub/grafana-dashboards:

Action points:

Tracking out of memory issues

I'm not sure if this new grafana dashboard has been trialed in our hubs, so let's assume it isn't and verify it:

Action points:

GeorgianaElena commented 1 year ago

@consideRatio, I believe for the "Update grafana dashboard" action point, what needs to happen is to manually run trigger the workflow in https://github.com/2i2c-org/infrastructure/blob/master/.github/workflows/deploy-grafana-dashboards.yaml and make sure all the clusters are listed there. That worflow should know to deploy the latest dashboards in jupyterhub/grafana-dashboards

Note: An manually created dashboards (if any?) in the grafanas (in the JupyterHub Default Dashboards dir only I belive) will get wiped.

consideRatio commented 1 year ago

In https://2i2c.freshdesk.com/a/tickets/415 we would have wanted this resolved!

consideRatio commented 1 year ago

OOMKills

I've used the command cat /dev/zero | head -c 10G | tail to either have the process in the container killed or the container itself killed by a pod eviction.

I've failed to see the OOMKills dashboard populated with data, because as part of the user pod getting evicted/OOMKilled which I figure should register, I see that the support-prometheus-node-exporter pod that should observe it is getting evicted itself by the node which is low on memory. Doh!

staging       104s        Normal    Created                     pod/jupyter-erik-402i2c-2eorg                          Created container notebook
staging       104s        Normal    Started                     pod/jupyter-erik-402i2c-2eorg                          Started container notebook
staging       104s        Normal    Pulled                      pod/jupyter-erik-402i2c-2eorg                          Container image "quay.io/2i2c/2i2c-hubs-image:69b1f9dff7c7" already present on machine
support       5s          Warning   Failed                      pod/support-cryptnono-58v2k                            Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: container init was OOM-killed (memory limit too low?): unknown
support       4s          Warning   BackOff                     pod/support-cryptnono-58v2k                            Back-off restarting failed container
support       3s          Warning   BackOff                     pod/support-prometheus-node-exporter-nxcqs             Back-off restarting failed container
support       2s          Warning   Evicted                     pod/support-prometheus-node-exporter-nxcqs             The node was low on resource: memory.
support       0s          Normal    SuccessfulDelete            daemonset/support-prometheus-node-exporter             Deleted pod: support-prometheus-node-exporter-nxcqs
support       0s          Warning   FailedDaemonPod             daemonset/support-prometheus-node-exporter             Found failed daemon pod support/support-prometheus-node-exporter-nxcqs on node ip-192-168-29-112.ca-central-1.compute.internal, will try to kill it

So I figure we need some memory requests for the node-exporter part of the support chart, and also for cryptnono.

  Namespace                   Name                                      CPU Requests  CPU Limits  Memory Requests   Memory Limits      Age
  ---------                   ----                                      ------------  ----------  ---------------   -------------      ---
  kube-system                 aws-node-d5z62                            25m (1%)      0 (0%)      0 (0%)            0 (0%)             14d
  kube-system                 ebs-csi-controller-58b9ff5786-gz62k       60m (3%)      600m (31%)  240Mi (3%)        1536Mi (21%)       14d
  kube-system                 ebs-csi-node-46jfr                        30m (1%)      300m (15%)  120Mi (1%)        768Mi (10%)        14d
  kube-system                 kube-proxy-slhsk                          100m (5%)     0 (0%)      0 (0%)            0 (0%)             14d
  staging                     jupyter-erik-402i2c-2eorg                 50m (2%)      0 (0%)      6979321856 (94%)  8589934592 (115%)  11m
  support                     support-cryptnono-58v2k                   0 (0%)        0 (0%)      0 (0%)            0 (0%)             29m
  support                     support-prometheus-node-exporter-d5cgq    0 (0%)        0 (0%)      0 (0%)            0 (0%)             9m34s

It seems that cryptnono requires little to no CPU, but 60Mi of memory. I'll go ahead and trial having cryptnono requests 60Mi and limits to 120Mi, and node exporter requests 20Mi, and limits to 40Mi.

support-cryptnono-bm9fj                             1m           60Mi            
support-cryptnono-lgc5g                             1m           60Mi            
support-cryptnono-pqt9h                             1m           61Mi            
support-cryptnono-rfpxl                             1m           60Mi            
support-cryptnono-rnlzc                             1m           60Mi                 
support-prometheus-node-exporter-fvvtr              2m           20Mi            
support-prometheus-node-exporter-j97mg              1m           12Mi            
support-prometheus-node-exporter-lz2fs              5m           18Mi            
support-prometheus-node-exporter-m8gl6              1m           11Mi            
support-prometheus-node-exporter-s4hdd              1m           20Mi

Having tested these limits, I obseved issues. cryptnono struggled to startup, and prometheus-node-exporter crashed during memory pressure given the constraints. Now I'm testing to also specify cpu: 10m to 100m for both cryptnono and node-exporter, and increased the memory for cryptnono: 64Mi to 256Mi, and node-exporter: 32Mi to 64Mi.

I ended up increasing the node-exporter further: 64Mi to 64Mi.

Following this, I can in my user pod do:

jovyan@jupyter-erik-402i2c-2eorg:~$ cat /dev/zero | head -c 10G | tail
Killed

But this doesn't show up in the OOMKiller dashboard. I'm not confident what should and shouldn't show there. The entire user container wasn't killed after all, just the process within the container. So, there isn't a mention of eviction either.

I managed to get my user pod evicted now as well without the node-exporter being crashed. But, that didn't show up either.

staging       55s         Warning   Evicted                     pod/jupyter-erik-402i2c-2eorg                           The node was low on resource: memory.
staging       55s         Normal    Killing                     pod/jupyter-erik-402i2c-2eorg                           Stopping container notebook

I think perhaps the node-exporter used is decoupled from this dashboard, as well as the node CPU % etc?

grafana.ubc-eoas.2i2c.cloud metrics

image

2i2c.pilot.2i2c.cloud metrics

In this hub, the node % works. At the same time, in this hub kubectl top node also works.

image

yuvipanda commented 1 year ago

That's some super cool testing, @consideRatio! I agree that setting requirements for these support components is a great idea.

I did test them, with some results in https://github.com/jupyterhub/grafana-dashboards/pull/52#issuecomment-1397603086. I tested them by starting a user server, and then basically having a user do something that caused their 'kernel to restart'. It registered as an OOM kill on the node. Process kills based on hitting the container's configured memory limit was what I was primarily interested in, rather than nodes dying based on memory pressure.

consideRatio commented 1 year ago

@yuvipanda where did you do the tests? I tested on staging.ubc-eoas.2i2c.cloud - AWS EKS. There I can't do kubectl top node for example, so I was thinking maybe you did it on GCP where such commands works and metrics is being collected perhaps a bit differently?

My tests were done with a pod that consumed almost the full capacity of the node, which made k8s observe memory preassure which triggered pod evictions. I guess you ran tests on GCP where for example a user is limited to 1GB perhaps - less than the nodes memory - and that your tests therefore didn't lead to memory preassure on the node itself.

Hmm... But I got a lot of killed from within the container as well, similar to what you tested before. And that also wasn't captured in the dashboard so something is up at least on the AWS EKS cluster i tested on.

yuvipanda commented 1 year ago

I guess you ran tests on GCP where for example a user is limited to 1GB perhaps - less than the nodes memory - and that your tests therefore didn't lead to memory preassure on the node itself.

I tested on the openscapes hub, which is EKS. The user was testing code that was using a lot of memory, so I suspect it was trying to make one big malloc and immediately failing, rather than a number of smaller ones that succeed until memory is full?

I would suggest testing on something with an artificial container memory limit perhaps to see how it goes?

consideRatio commented 1 year ago

I've extracted a dedicated issue in #2213 for OOMKiller, closing this!