Closed celine168 closed 3 years ago
Possible solutions (please edit/add more):
Haven't had the time to look into this yet, but I think we should set requests and limits to the same amount. The docs say containers won't get scheduled on a node if the requests sum exceeds the node capacities (not limits). This should at least guarantee no node failures due to OOM.
For example, we have our memory resource request set as 1G, and limit set as 8G. This means if a node has 16GB of memory (ignoring the OS) then up to 16 containers will get scheduled onto the node before Kubernetes stops scheduling pods onto it. If after pods are spawned, suddenly users ramp up memory in each pod, the node will run out of memory. I don't think any pod has a higher nice value, so the OOM killer is basically gonna kill host processes equally likely as ones inside a pod.
Yes setting the requests and limits to the same amount should do much more to prevent failures like these. Also, TIL what nice values are.
In our first failure (chick1), we definitely ran out of memory. Excerpt from chick1's /var/log/kern.log
:
Nov 3 13:20:45 chick1 kernel: [5181979.056139] calico-node invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=-998
...
Nov 3 13:20:46 chick1 kernel: [5181979.056645] Out of memory: Killed process 5412 (rsession) total-vm:8600120kB, anon-rss:7673408kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:15540kB oom_score_adj:936
I guess Calico got killed and the entire node's networking went down. Also, an rsession
process was using 8GB of memory and got killed as well. (Update: I might be wrong and Calico is not killed, instead it asked the kernel to kill rsession? I'm not entirely sure. Either way, having OOM killer be triggered on the node and not in a cgroup is not ideal)
In our second failure (chick6), there are some OOM logs, but those say Memory cgroup out of memory: Killed process 21913 (R)
instead. R is still taking up a bunch of memory, but it's the cgroup that ran out of memory (so the container itself reached the limit defined by k8s), not the host itself. Also, the timestamp was 7 hours before the actual failure. I'm still investigating why chick6 failed yesterday.
I see networking errors when the node failed: enp1s0: Could not set DHCPv4 address: Connection timed out
, which is what brought the node down, but I can't find a root cause for it.
There is a message from SMART near the time of failure: Nov 04 22:49:24 chick6 smartd[960]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 116 to 115
, but I highly doubt this is correct (we'd be boiling water at this point). Since all of our nodes have similar logs (all around 110~120), I'm going to assume this is actually fahrenheit and not an issue.
Other than this, no journalctl log seems out of the ordinary, and dmesg logs show nothing near the time of failure. I'm still going to poke around, but I don't know what the issue could be. Grafana didn't show mem/CPU/load avg being out of the ordinary before failure.
Tentatively updating resource constraints to be as follows: | JupyterHub | BinderHub | |
---|---|---|---|
Original CPU guarantee / limit | 0.5 / 4 | 0.5 / 4 | |
New CPU guarantee / limit | 0.5 / 4 | 0.5 / 4 | |
Original memory guarantee / limit | 1G / 8G | 1G / 8G | |
New memory guarantee / limit | 7G / 7G | 2G / 2G |
If a node's CPU gets overloaded, nothing too catastrophic is going to happen, things are just going to run slower, so I didn't change that. See the earlier comment to see why I think having the same guarantee and limit on memory is important. I wanted to keep our original 8GB limit for user pods in JupyterHub, but since we have a bunch of nodes with 15GB RAM, that means k8s will schedule at most one pod there and we're wasting a lot of resources. 7GB would let us squeeze two pods in it, and I don't think it's that much of a dealbreaker. For Binder, since those pods come and go fairly quickly and usually only run short-lived code in a LibreTexts textbook, they shouldn't be too resource intensive, so I reduced the memory guarantee/limit to make sure more can spawn. In my opinion, we can even further limit the CPU and memory (maybe something like guarantee 0.25 and limit 1 core, and give 1GB of memory), but that's a different issue.
I think the R use is Lindsay's class, not Magali's. Maybe she's had them load super large datasets fully in memory or something?
The upgrade is done (memory resource limits and requests are now the same), so I'm tentatively closing this issue. I'm fairly confident we won't get another OOM failure again, unless the host machine does something stupid (like kubelet itself suddenly using 10GB of ram or something). We can reopen it if another node fails.
Context
During Magali Billen's class yesterday, her students had trouble running code in class. Only 6 people were logged in and they were doing light-weight work (if-else statements), but they started to have problems running, with blue boxes showing up asking whether they wanted to restart their kernels.
Chick1 was down (we couldn't SSH into it), most likely out of memory (
System OOM encountered, victim process: rsession, pid: 5412
in the kubectl command below).Solution
ipmitool power cycle -H <chick1 IPMI IP> -U <IPMI username>
At the time, the pods seemed to be focused on a couple of nodes rather than spread. We seemed to fix this in #110 but this doesn't seem to be working anymore. For ex. the pods for JupyterHub are spawning on chicks 3, 4, and 6.
I think disabling the user scheduler might help since the user scheduler seems to be for clusters on the cloud which autoscale (i.e. add/remove nodes as needed?).
Logs