Chick1 abruptly shut down (OOM)

celine168 commented 4 years ago

Context

During Magali Billen's class yesterday, her students had trouble running code in class. Only 6 people were logged in and they were doing light-weight work (if-else statements), but they started to have problems running, with blue boxes showing up asking whether they wanted to restart their kernels.

Chick1 was down (we couldn't SSH into it), most likely out of memory (System OOM encountered, victim process: rsession, pid: 5412 in the kubectl command below).

Solution

Node was already marked as NotReady so no new pods were scheduled onto it
Hao restarted the node using ipmi: ipmitool power cycle -H <chick1 IPMI IP> -U <IPMI username>

At the time, the pods seemed to be focused on a couple of nodes rather than spread. We seemed to fix this in #110 but this doesn't seem to be working anymore. For ex. the pods for JupyterHub are spawning on chicks 3, 4, and 6.

I think disabling the user scheduler might help since the user scheduler seems to be for clusters on the cloud which autoscale (i.e. add/remove nodes as needed?).

Logs

spicy@rooster:~$ kubectl describe node chick1
Name:               chick1
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=chick1
                    kubernetes.io/os=linux
Annotations:        node.alpha.kubernetes.io/ttl: 0
                    projectcalico.org/IPv4Address: 10.0.0.101/24
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Thu, 26 Mar 2020 15:10:49 -0700
Taints:             node.kubernetes.io/unreachable:NoExecute
                    node.kubernetes.io/unreachable:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  chick1
  AcquireTime:     <unset>
  RenewTime:       Tue, 03 Nov 2020 13:41:03 -0800
Conditions:
  Type                    Status    LastHeartbeatTime                 LastTransitionTime                Reason                    Message
  ----                    ------    -----------------                 ------------------                ------                    -------
  ReadonlyFilesystem      False     Tue, 03 Nov 2020 13:54:12 -0800   Fri, 04 Sep 2020 15:02:29 -0700   FilesystemIsNotReadOnly   Filesyst                            em is not read-only
  CorruptDockerOverlay2   False     Tue, 03 Nov 2020 13:54:12 -0800   Fri, 04 Sep 2020 15:02:29 -0700   NoCorruptDockerOverlay2   docker o                            verlay2 is functioning properly
  KernelDeadlock          False     Tue, 03 Nov 2020 13:54:12 -0800   Fri, 04 Sep 2020 15:02:29 -0700   KernelHasNoDeadlock       kernel h                            as no deadlock
  NetworkUnavailable      False     Fri, 04 Sep 2020 15:02:06 -0700   Fri, 04 Sep 2020 15:02:06 -0700   CalicoIsUp                Calico i                            s running on this node
  MemoryPressure          Unknown   Tue, 03 Nov 2020 13:41:02 -0800   Tue, 03 Nov 2020 13:41:44 -0800   NodeStatusUnknown         Kubelet                             stopped posting node status.
  DiskPressure            Unknown   Tue, 03 Nov 2020 13:41:02 -0800   Tue, 03 Nov 2020 13:41:44 -0800   NodeStatusUnknown         Kubelet                             stopped posting node status.
  PIDPressure             Unknown   Tue, 03 Nov 2020 13:41:02 -0800   Tue, 03 Nov 2020 13:41:44 -0800   NodeStatusUnknown         Kubelet                             stopped posting node status.
  Ready                   Unknown   Tue, 03 Nov 2020 13:41:02 -0800   Tue, 03 Nov 2020 13:41:44 -0800   NodeStatusUnknown         Kubelet                             stopped posting node status.
Addresses:
  InternalIP:  10.0.0.101
  Hostname:    chick1
Capacity:
  cpu:                6
  ephemeral-storage:  1920710920Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             16379840Ki
  pods:               110
Allocatable:
  cpu:                6
  ephemeral-storage:  1770127180942
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             16277440Ki
  pods:               110
System Info:
  Machine ID:                 52dc3e15d80145dbba4442b137bfce26
  System UUID:                49434d53-0200-9054-2500-54902500a87e
  Boot ID:                    e34fba5d-d385-4d2b-933d-10b4b8200024
  Kernel Version:             5.4.0-45-generic
  OS Image:                   Ubuntu 18.04.5 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  docker://19.3.6
  Kubelet Version:            v1.19.0
  Kube-Proxy Version:         v1.19.0
PodCIDR:                      10.244.1.0/24
PodCIDRs:                     10.244.1.0/24
Non-terminated Pods:          (25 in total)
  Namespace                   Name                                                         CPU Requests  CPU Limits  Memory Requests  Memo                            ry Limits     AGE
  ---------                   ----                                                         ------------  ----------  ---------------  ----                            ---------     ---
  binderhub                   binderhub-image-cleaner-wnzhk                                0 (0%)        0 (0%)      0 (0%)           0 (0                            %)            49d
  binderhub                   continuous-image-puller-gdbvz                                0 (0%)        0 (0%)      0 (0%)           0 (0                            %)            39d
  binderhub                   jupyter-libretexts-2ddefault-2denv-2d4zvo652e                500m (8%)     4 (66%)     1073741824 (6%)  8589                            934592 (51%)  126m
  binderhub                   jupyter-libretexts-2ddefault-2denv-2dxjwbtfkp                500m (8%)     4 (66%)     1073741824 (6%)  8589                            934592 (51%)  82m
  default                     nfs-client-release-nfs-client-provisioner-7d79d49fb-mdg99    0 (0%)        0 (0%)      0 (0%)           0 (0                            %)            60d
  jhub                        autohttps-6c989f5958-lrgvw                                   0 (0%)        0 (0%)      0 (0%)           0 (0                            %)            59d
  jhub                        continuous-image-puller-f9nbp                                0 (0%)        0 (0%)      0 (0%)           0 (0                            %)            26d
  jhub                        jupyter-<email>                                            500m (8%)     4 (66%)     1073741824 (6%)  8589                            934592 (51%)  115m
  jhub                        jupyter-<email>                                            500m (8%)     4 (66%)     1073741824 (6%)  8589                            934592 (51%)  115m
  jhub                        jupyter-<email>                                            500m (8%)     4 (66%)     1073741824 (6%)  8589                            934592 (51%)  129m
  jhub                        jupyter-<email>                                            500m (8%)     4 (66%)     1073741824 (6%)  8589                            934592 (51%)  116m
  jhub                        jupyter-<email>                                            500m (8%)     4 (66%)     1073741824 (6%)  8589                            934592 (51%)  134m
  jhub                        jupyter-<email>                                            500m (8%)     4 (66%)     1073741824 (6%)  8589                            934592 (51%)  114m
  jhub                        jupyter-<email>                                            500m (8%)     4 (66%)     1073741824 (6%)  8589                            934592 (51%)  145m
  jhub                        jupyter-<email>                                            500m (8%)     4 (66%)     1073741824 (6%)  8589                            934592 (51%)  4h4m
  jhub                        jupyter-<email>                                            500m (8%)     4 (66%)     1073741824 (6%)  8589                            934592 (51%)  115m
  kube-system                 calico-node-g9h9q                                            250m (4%)     0 (0%)      0 (0%)           0 (0                            %)            60d
  kube-system                 kube-proxy-vb59z                                             0 (0%)        0 (0%)      0 (0%)           0 (0                            %)            60d
  metallb-system              speaker-4hx5r                                                100m (1%)     100m (1%)   100Mi (0%)       100M                            i (0%)        116d
  monitoring                  node-problem-detector-c29tv                                  0 (0%)        0 (0%)      0 (0%)           0 (0                            %)            74d
  monitoring                  prometheus-operator-kube-state-metrics-69fcc8d48c-f74ms      0 (0%)        0 (0%)      0 (0%)           0 (0                            %)            60d
  monitoring                  prometheus-operator-prometheus-node-exporter-l9xwd           0 (0%)        0 (0%)      0 (0%)           0 (0                            %)            124d
  staging-jhub                autohttps-59ff465659-v225z                                   0 (0%)        0 (0%)      0 (0%)           0 (0                            %)            52d
  staging-jhub                continuous-image-puller-wg8x9                                0 (0%)        0 (0%)      0 (0%)           0 (0                            %)            14d
  workshop-jhub               continuous-image-puller-2hw2p                                0 (0%)        0 (0%)      0 (0%)           0 (0                            %)            52d
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests           Limits
  --------           --------           ------
  cpu                5850m (97%)        44100m (735%)
  memory             11916017664 (71%)  94594138112 (567%)
  ephemeral-storage  0 (0%)             0 (0%)
  hugepages-1Gi      0 (0%)             0 (0%)
  hugepages-2Mi      0 (0%)             0 (0%)
Events:
  Type     Reason                   Age   From     Message
  ----     ------                   ----  ----     -------
  Warning  SystemOOM                47m   kubelet  System OOM encountered, victim process: rsession, pid: 5412
  Normal   NodeHasSufficientMemory  47m   kubelet  Node chick1 status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    47m   kubelet  Node chick1 status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     47m   kubelet  Node chick1 status is now: NodeHasSufficientPID
  Normal   NodeReady                47m   kubelet  Node chick1 status is now: NodeReady

celine168 commented 4 years ago

Possible solutions (please edit/add more):

Removing the userScheduler since it wants to schedule all pods into the same node(s)
- The "spread" scheduler strategy in the current config is technically deprecated but seemed to work when we first used it.
Lowering the CPU/Memory limit to guarantee ratio to 2:1. The docs recommend this ratio.
- Our ratios are currently:
  - CPU: limit 4, guarantee 0.5
  - Memory: limit 8G, guarantee 1G

rkevin-arch commented 4 years ago

Haven't had the time to look into this yet, but I think we should set requests and limits to the same amount. The docs say containers won't get scheduled on a node if the requests sum exceeds the node capacities (not limits). This should at least guarantee no node failures due to OOM.

For example, we have our memory resource request set as 1G, and limit set as 8G. This means if a node has 16GB of memory (ignoring the OS) then up to 16 containers will get scheduled onto the node before Kubernetes stops scheduling pods onto it. If after pods are spawned, suddenly users ramp up memory in each pod, the node will run out of memory. I don't think any pod has a higher nice value, so the OOM killer is basically gonna kill host processes equally likely as ones inside a pod.

celine168 commented 4 years ago

Yes setting the requests and limits to the same amount should do much more to prevent failures like these. Also, TIL what nice values are.

rkevin-arch commented 4 years ago

In our first failure (chick1), we definitely ran out of memory. Excerpt from chick1's /var/log/kern.log:

Nov  3 13:20:45 chick1 kernel: [5181979.056139] calico-node invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=-998
...
Nov  3 13:20:46 chick1 kernel: [5181979.056645] Out of memory: Killed process 5412 (rsession) total-vm:8600120kB, anon-rss:7673408kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:15540kB oom_score_adj:936

I guess Calico got killed and the entire node's networking went down. Also, an rsession process was using 8GB of memory and got killed as well. (Update: I might be wrong and Calico is not killed, instead it asked the kernel to kill rsession? I'm not entirely sure. Either way, having OOM killer be triggered on the node and not in a cgroup is not ideal)

In our second failure (chick6), there are some OOM logs, but those say Memory cgroup out of memory: Killed process 21913 (R) instead. R is still taking up a bunch of memory, but it's the cgroup that ran out of memory (so the container itself reached the limit defined by k8s), not the host itself. Also, the timestamp was 7 hours before the actual failure. I'm still investigating why chick6 failed yesterday.

rkevin-arch commented 4 years ago

I see networking errors when the node failed: enp1s0: Could not set DHCPv4 address: Connection timed out, which is what brought the node down, but I can't find a root cause for it. There is a message from SMART near the time of failure: Nov 04 22:49:24 chick6 smartd[960]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 116 to 115, but I highly doubt this is correct (we'd be boiling water at this point). Since all of our nodes have similar logs (all around 110~120), I'm going to assume this is actually fahrenheit and not an issue. Other than this, no journalctl log seems out of the ordinary, and dmesg logs show nothing near the time of failure. I'm still going to poke around, but I don't know what the issue could be. Grafana didn't show mem/CPU/load avg being out of the ordinary before failure.

rkevin-arch commented 4 years ago

Tentatively updating resource constraints to be as follows:		JupyterHub
Original CPU guarantee / limit	0.5 / 4	0.5 / 4
New CPU guarantee / limit	0.5 / 4	0.5 / 4
Original memory guarantee / limit	1G / 8G	1G / 8G
New memory guarantee / limit	7G / 7G	2G / 2G

If a node's CPU gets overloaded, nothing too catastrophic is going to happen, things are just going to run slower, so I didn't change that. See the earlier comment to see why I think having the same guarantee and limit on memory is important. I wanted to keep our original 8GB limit for user pods in JupyterHub, but since we have a bunch of nodes with 15GB RAM, that means k8s will schedule at most one pod there and we're wasting a lot of resources. 7GB would let us squeeze two pods in it, and I don't think it's that much of a dealbreaker. For Binder, since those pods come and go fairly quickly and usually only run short-lived code in a LibreTexts textbook, they shouldn't be too resource intensive, so I reduced the memory guarantee/limit to make sure more can spawn. In my opinion, we can even further limit the CPU and memory (maybe something like guarantee 0.25 and limit 1 core, and give 1GB of memory), but that's a different issue.

moorepants commented 3 years ago

I think the R use is Lindsay's class, not Magali's. Maybe she's had them load super large datasets fully in memory or something?

rkevin-arch commented 3 years ago

The upgrade is done (memory resource limits and requests are now the same), so I'm tentatively closing this issue. I'm fairly confident we won't get another OOM failure again, unless the host machine does something stupid (like kubelet itself suddenly using 10GB of ram or something). We can reopen it if another node fails.

LibreTexts / metalc