OpenNebula / one

The open source Cloud & Edge Computing Platform bringing real freedom to your Enterprise Cloud 🚀
http://opennebula.io
Apache License 2.0
1.2k stars 475 forks source link

use memory and cpuset croups as base for resource calculation #1354

Open OpenNebulaProject opened 6 years ago

OpenNebulaProject commented 6 years ago

Author Name: Anton Todorov (Anton Todorov) Original Redmine Issue: 5421, https://dev.opennebula.org/issues/5421 Original Date: 2017-10-02


Hi,

The recent linux distributions use cgroup for memory and cpu separation when a cgroup is in use, like 'machine.slice' or similar (configurable?) use size in the cgroup and the count of CPUs in the relevant cgroups.

Currently we have a script to query and calculate the values for RESERVED_* variables per each host: https://github.com/OpenNebula/addon-storpool/blob/master/misc/reserved.sh

Best Regards, Anton Todorov

atodorov-storpool commented 4 years ago

Hi @rsmontero ,

Knowing that you are focused on the release of 5.10 please threat this update as not urgent.

I am trying to figure out a resolution for this issue but with the introduction of the NUMA it is becoming complicated. The initial PoC draft has two(incomplete) files :

0reserved_resources.sh - this implant is gaming the monitoring script and opennebula account the resources in a machine.slice cgroup instead of the host resources. There is some more work to make it configurable and so on. I am adding it just as a reference here. The final code will definitely have another shape, most probably integrated in kvm.rb.

numa.rb - only a parser of the cgroups is implemented and the results are printed on stderr.

Following is the numa.rb output from Intel Xeon Silver 4114:

# ./numa.rb 
{"cpuset"=>{"cpus"=>"0-10,14-29,34-39", "mems"=>"0-1", :cpus=>[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 34, 35, 36, 37, 38, 39], :mems=>[0, 1]}, "memory"=>{"limit_in_bytes"=>"309283782656", "usage_in_bytes"=>"267066503168", :limit=>302034944, :usage=>260807132}}
HUGEPAGE = [ NODE_ID = "0", SIZE = "1048576", PAGES = "0", FREE = "0" ]
HUGEPAGE = [ NODE_ID = "0", SIZE = "2048", PAGES = "283", FREE = "0" ]
CORE = [ NODE_ID = "0", ID = "0", CPUS = "0,20" ]
CORE = [ NODE_ID = "0", ID = "1", CPUS = "1,21" ]
CORE = [ NODE_ID = "0", ID = "2", CPUS = "2,22" ]
CORE = [ NODE_ID = "0", ID = "3", CPUS = "3,23" ]
CORE = [ NODE_ID = "0", ID = "4", CPUS = "4,24" ]
CORE = [ NODE_ID = "0", ID = "8", CPUS = "5,25" ]
CORE = [ NODE_ID = "0", ID = "9", CPUS = "6,26" ]
CORE = [ NODE_ID = "0", ID = "10", CPUS = "7,27" ]
CORE = [ NODE_ID = "0", ID = "11", CPUS = "8,28" ]
CORE = [ NODE_ID = "0", ID = "12", CPUS = "9,29" ]
MEMORY_NODE = [ NODE_ID = "0", TOTAL = "166373912", FREE = "7297164", USED = "159076748", DISTANCE = "0 1" ]
HUGEPAGE = [ NODE_ID = "1", SIZE = "1048576", PAGES = "0", FREE = "0" ]
HUGEPAGE = [ NODE_ID = "1", SIZE = "2048", PAGES = "0", FREE = "0" ]
CORE = [ NODE_ID = "1", ID = "0", CPUS = "10,30" ]
CORE = [ NODE_ID = "1", ID = "1", CPUS = "11,31" ]
CORE = [ NODE_ID = "1", ID = "2", CPUS = "12,32" ]
CORE = [ NODE_ID = "1", ID = "3", CPUS = "13,33" ]
CORE = [ NODE_ID = "1", ID = "4", CPUS = "14,34" ]
CORE = [ NODE_ID = "1", ID = "8", CPUS = "15,35" ]
CORE = [ NODE_ID = "1", ID = "9", CPUS = "16,36" ]
CORE = [ NODE_ID = "1", ID = "10", CPUS = "17,37" ]
CORE = [ NODE_ID = "1", ID = "11", CPUS = "18,38" ]
CORE = [ NODE_ID = "1", ID = "12", CPUS = "19,39" ]
MEMORY_NODE = [ NODE_ID = "1", TOTAL = "167772160", FREE = "26467428", USED = "141304732", DISTANCE = "1 0" ]

And from AMD EPYC 7251

# ./numa.rb 
{"cpuset"=>{"cpus"=>"0-2,4-10,12-15", "mems"=>"1,3", :cpus=>[0, 1, 2, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15], :mems=>[1, 3]}, "memory"=>{"limit_in_bytes"=>"27495759872", "usage_in_bytes"=>"0", :limit=>26851328, :usage=>0}}
CORE = [ NODE_ID = "0", ID = "0", CPUS = "0,8" ]
CORE = [ NODE_ID = "0", ID = "4", CPUS = "1,9" ]
MEMORY_NODE = [ NODE_ID = "0", TOTAL = "0", FREE = "0", USED = "0", DISTANCE = "0 3" ]
HUGEPAGE = [ NODE_ID = "1", SIZE = "1048576", PAGES = "0", FREE = "0" ]
HUGEPAGE = [ NODE_ID = "1", SIZE = "2048", PAGES = "32", FREE = "0" ]
CORE = [ NODE_ID = "1", ID = "8", CPUS = "2,10" ]
CORE = [ NODE_ID = "1", ID = "12", CPUS = "3,11" ]
MEMORY_NODE = [ NODE_ID = "1", TOTAL = "16688380", FREE = "11569980", USED = "5118400", DISTANCE = "1 3" ]
CORE = [ NODE_ID = "2", ID = "16", CPUS = "4,12" ]
CORE = [ NODE_ID = "2", ID = "20", CPUS = "5,13" ]
MEMORY_NODE = [ NODE_ID = "2", TOTAL = "0", FREE = "0", USED = "0", DISTANCE = "2 3" ]
HUGEPAGE = [ NODE_ID = "3", SIZE = "1048576", PAGES = "0", FREE = "0" ]
HUGEPAGE = [ NODE_ID = "3", SIZE = "2048", PAGES = "0", FREE = "0" ]
CORE = [ NODE_ID = "3", ID = "24", CPUS = "6,14" ]
CORE = [ NODE_ID = "3", ID = "28", CPUS = "7,15" ]
MEMORY_NODE = [ NODE_ID = "3", TOTAL = "16762880", FREE = "13730040", USED = "3032840", DISTANCE = "3 2" ]

I'd like to discuss how to address the following issues:

  1. cpuset.cpus are the threads where a VM could be started. There are two options 1.1 extract the values from COREs but probably this will break things 1.2 report a list of the excluded cores to ignore (CGROUP_IGNORED=....) 1.3 report a list of the allowed cores (CGROUP_ALLOWED=...)

  2. cpuset.mems these are the NUMA nodes which memory could be used. 2.1 a memory of a NUMA node which is not listed should be excluded by the scheduler. 2.2 following (2.1) the total memory reported should be min(memory.limit_in_bytes, NUMA_memory). This change would need to be addressed in both numa.rb and kvm.rb? or there is precedence which memory is accounted when there are values in MEMORY_NODE (via numa.rb and TOTALMEMORY/USEDMEMORY (via kvm.rb)

Am I missing something?

And all this should be integrated into the Core for processing. But I have no clue how to proceed? I'd guess that altering with minimal intervention is preferred but any guidance/help is welcome :)

Best, Anton

rsmontero commented 4 years ago

Thanks @atodorov-storpool

The current approach of using cgroups is described in the KVM documentation here. The idea is to assign cpu shares based on the CPU attribute, this is how KVM deployment file is generated.

About NUMA, we assume that NUMA is mainly to pin VMs to CPU cores/threads, so cgroups or other sharing things are not needed (As they are fundamentally opposed to pin a VM). However you can control some overcommitment in pinnig as described here.

Finally RESERVED_* are for completely different purpose. The goal of this attributes is to reserve resources to the hypervisor itself or to overcommit the overall capacity of the host.

So what are you missing in the current approach? (I would say that the functionality is already present in the system as described above)

atodorov-storpool commented 4 years ago

The current approach of using cgroups is described in the KVM documentation here. The idea is to assign cpu shares based on the CPU attribute, this is how KVM deployment file is generated. About NUMA, we assume that NUMA is mainly to pin VMs to CPU cores/threads, so cgroups or other sharing things are not needed (As they are fundamentally opposed to pin a VM). However you can control some overcommitment in pinnig as described here. Finally RESERVED_* are for completely different purpose. The goal of this attributes is to reserve resources to the hypervisor itself or to overcommit the overall capacity of the host.

The implementation address only the CPU shares for a given task. The issue I am addressing is when there is cpu and numa partitioning implemented with the cpuset cgroup and the memory partitioning implemented with the memory cgroup.

The use case is to have guaranteed exclusive set of resources (cpu and amount of memory) dedicated to particular set of tasks. It is common on hyper-converged setups where different tasks are executing on a host with a lot of resources.

Get for example the following tree

 root
  | 
  |\ 
  | ` machine.slice
  |\
  | ` system.slice
  |\
  | ` user.slice
   \
    ` highperformance.slice

In the system.slice are all system services and the administrator want to allocate one core(two threads) and some amount of RAM to them. So in case of a memory leak in a system service the OOM killer will not kill VMs in the machine slice. For the highperomance slice the administrator allocates few cores and memory on the NUMA where the NIC and HBA controller interrupts are routed on the motherboard. The rest of the cores and memory are allocated for VMs in the machine.slice.

Here is what OpenNebula knows about this setup per host:

  1. RESERVED_CPU - some of the cores are reserved for system services. If the administrator want to define true cpu over-subscription there is need to manually calculate how much cores are actually used for VMs against the overcommitment factor
  2. RESERVED_MEM - some of the memory is reserved. So the scheduler will take in account this, but in case of memory upgrade these values must be adjusted manually.
  3. NUMA - there are cpu topology and memory per NUMA domain. The cpus/threads could be excluded, again manuallly. But with the memory it is becoming complicated. For example, when the momory of an entire numa is excluded via cpuset it is not addressed.

My point is that currently there are several sources of data that must be adjusted manually which is error prone. Human errors could be made on several places - cgroups configuration, opennebula configuration, etc. All of the above manual operations could be addressed by OpenNebula and at least part of the human errors could be eliminated.

atodorov-storpool commented 4 years ago

Just to add that DPDK(which is supported by OpenNebula) also use cpuset cgroup and has NUMA related memory constrains to provide optimal performance ;)