FastGeert commented 6 years ago

Problem

When provisioning vm's to cpu nodes, we take into account the memory they will consume from the host system, and distribute vms accordingly over the nodes.

We do not control yet how much memory is used by other supporting processes (agent, alba, ...) running on the cpu nodes. Based on the size of the node we do reserve memory that should not be used by vms (see https://github.com/0-complexity/selfhealing/blob/master/specs/provisioning-limits.md), but we do not limit it in the system.

Hence, if the memory allocation of the supporting processes goes out of bounds, we loose control, and cannot predict anymore how the linux OOM killer will start behaving, killing eg vm's.

Solution

Implement a cgroup in which we run all supporting processes, and limit the amount of memory they can use all together.
Implement healtch checks that trigger alarm when memory usage of the cgroup goes over 80%.
Revise the numbers in https://github.com/0-complexity/selfhealing/blob/master/specs/provisioning-limits.md, and take them into account in the vm provisioning logic. @delandtj one for you.

grimpy commented 6 years ago

Could we leverage here on systemd which has builtin support for this kind of stuff?

FastGeert commented 6 years ago

I don't have a problem with that.

FastGeert commented 6 years ago

Moved to small nodes setup because due to the hyperconverged model we even need more control over this.

0-complexity / openvcloud

Better control for memory allocation on cpu nodes via cgroups #1069

Problem

Solution