Open nobuto-m opened 1 year ago
Marking this as something we may consider looking into later.
If you need guarantees around CPU and memory, you want to use CPU pinning and hugepages, as anything else can indeed be quite seriously overcommitted. We're not likely to find a way to prevent or even calculate an overcommit amount. That's because while it's doable for VMs, it is not for containers and as you can have both, it makes the value ultimately useless.
That said, there are some obvious things that we likely should check prior to instance startup and block starting instances that we know will immediately cause an OOM situation or the like.
Similar to https://github.com/canonical/lxd/issues/8682
Required information
5.15.0-67-generic #74-Ubuntu
Issue description
As far as I can see, LXD may not have an idea of setting a per-host overcommit ratio for CPU/memory/disk resources.
For example, it seems that one can create a big VM exceeding host resources as long as it's within the project quota.
Let's say:
Project A: limits.cpu=200 limits.memory=800GiB Project B: limits.cpu=10 limits.memory=200GiB
and if Project A used almost all resources in the cluster, it looks like Project B can still create a VM with 200GiB on 128GiB host for example, and potentially can kill a VM from Project A or a host process based on OOM killer.
Steps to reproduce
As a minimal reproducer:
lxc launch ubuntu:jammy test-vm-1 --vm -c limits.memory=64GiB
on 32GiB system for examplelxc exec test-vm-1 -- free -h
-> 64GiBlxc exec test-vm-1 -- stress-ng -m 1 --vm-bytes 50G --timeout 60
Then OOM will be observed on the host:
Apr 06 14:42:55 t14 kernel: Out of memory: Killed process 220836 (qemu-system-x86) total-vm:69043096kB, anon-rss:51520kB, file-rss:0kB, shmem-rss:22018520kB, UID:999 pgtables:44320kB oom_score_adj:0