OpenNebula / one

The open source Cloud & Edge Computing Platform bringing real freedom to your Enterprise Cloud 🚀
http://opennebula.io
Apache License 2.0
1.23k stars 478 forks source link

NUMA monitoring inconsistent after host hardware change #6631

Open OpenNebulaSupport opened 3 months ago

OpenNebulaSupport commented 3 months ago

Description Error undefined method [] for nil:NilClass when running onehost show X after replacing the host hardware.

Presumably the new hardware has different NUMA capabilities but same hostname and SSH config.

To Reproduce Steps to reproduce the behavior: 1) Create a new dummy host: onehost create 127.0.0.1 --im kvm --vm kvm 2) Modify the host database body with: onedb update-body host --id <dummy-host-id>

Add the following body, making sure to adapt the ID to the ID of the dummy host:

New Body ```xml CHANGE ME! XXX 2 2 105 XXX 0 0 527460924 12800 511060804 12800 0 1 ```

3) Run onehost show <dummy-host-id>

The CLI output will be the following:

root@daniel-XPS-9320:/usr/lib/one/ruby/cli# onehost show 5
HOST 5 INFORMATION                                                              
ID                    : 5                   
NAME                  : XXX  
CLUSTER               : XXX     
STATE                 : MONITORED           
IM_MAD                : kvm                 
VM_MAD                : kvm                 
LAST MONITORING TIME  : 06/26 14:46:42      

HOST SHARES                                                                     
RUNNING VMS           : 0                   
MEMORY                                                                          
  TOTAL               : 31G                 
  TOTAL +/- RESERVED  : 15.3G               
  USED (REAL)         : 7.8G                
  USED (ALLOCATED)    : 0K                  
CPU                                                                             
  TOTAL               : 1600                
  TOTAL +/- RESERVED  : 1600                
  USED (REAL)         : 32                  
  USED (ALLOCATED)    : 0                   

MONITORING INFORMATION                                                          
ARCH="x86_64"
CGROUPS_VERSION="2"
CPUSPEED="0"
FENCE_IP="10.11.0.244"
HOSTNAME="daniel-XPS-9320"
HYPERVISOR="kvm"
IM_MAD="kvm"
KVM_CPU_FEATURES="vme,ds,acpi,ss,ht,tm,pbe,dtes64,monitor,ds_cpl,vmx,smx,est,tm2,xtpr,pdcm,osxsave,f16c,rdrand,arat,tsc_adjust,clflushopt,clwb,intel-pt,sha-ni,umip,pku,ospke,waitpkg,gfni,vaes,vpclmulqdq,rdpid,movdiri,movdir64b,pks,fsrm,md-clear,serialize,arch-lbr,stibp,arch-capabilities,core-capability,ssbd,avx-vnni,xsaveopt,xsavec,xgetbv1,xsaves,pdpe1gb,abm,invtsc,rdctl-no,ibrs-all,skip-l1dfl-vmentry,mds-no,pschange-mc-no"
KVM_CPU_MODEL="Broadwell-noTSX-IBRS"
KVM_CPU_MODELS="486 pentium pentium2 pentium3 pentiumpro coreduo n270 core2duo qemu32 kvm32 cpu64-rhel5 cpu64-rhel6 kvm64 Conroe Penryn Nehalem Nehalem-IBRS Westmere Westmere-IBRS SandyBridge SandyBridge-IBRS IvyBridge IvyBridge-IBRS SapphireRapids SapphireRapids-noTSX Opteron_G1"
KVM_MACHINES="pc-i440fx-jammy ubuntu pc-i440fx-impish-hpb pc-q35-5.2 pc-i440fx-2.12 pc-i440fx-2.0 pc-i440fx-xenial pc-i440fx-6.2 pc pc-q35-4.2 pc-i440fx-2.5 pc-i440fx-4.2 pc-i440fx-focal pc-i440fx-hirsute pc-q35-xenial pc-i440fx-jammy-hpb pc-i440fx-5.2 pc-i440fx-1.5 pc-q35-2.7 pc-q35-eoan-hpb pc-i440fx-zesty pc-i440fx-disco-hpb pc-q35-groovy pc-i440fx-groovy pc-q35-artful pc-i440fx-2.2 pc-i440fx-trusty pc-i440fx-eoan-hpb pc-q35-focal-hpb pc-q35-jammy-maxcpus pc-q35-bionic-hpb pc-i440fx-artful pc-i440fx-2.7 pc-q35-6.1 pc-i440fx-jammy-maxcpus pc-i440fx-yakkety pc-q35-2.4 pc-q35-cosmic-hpb pc-q35-2.10 x-remote pc-i440fx-1.7 pc-q35-5.1 pc-q35-2.9 pc-i440fx-2.11 pc-i440fx-jammy-hpb-maxcpus pc-q35-3.1 pc-i440fx-6.1 pc-q35-4.1 pc-q35-jammy ubuntu-q35 pc-i440fx-2.4 pc-i440fx-4.1 pc-q35-eoan pc-q35-jammy-hpb pc-i440fx-5.1 pc-i440fx-2.9 pc-i440fx-bionic-hpb isapc pc-i440fx-1.4 pc-q35-cosmic pc-q35-2.6 pc-i440fx-3.1 pc-q35-bionic pc-q35-disco-hpb pc-i440fx-cosmic pc-q35-2.12 pc-i440fx-bionic pc-q35-groovy-hpb pc-q35-disco pc-i440fx-cosmic-hpb pc-i440fx-2.1 pc-i440fx-wily pc-q35-impish pc-q35-6.0 pc-i440fx-impish pc-i440fx-2.6 pc-q35-impish-hpb pc-q35-hirsute pc-q35-4.0.1 pc-q35-hirsute-hpb pc-i440fx-1.6 pc-q35-5.0 pc-q35-2.8 pc-i440fx-2.10 pc-q35-3.0 pc-i440fx-6.0 pc-q35-zesty pc-q35-4.0 pc-q35-focal microvm pc-i440fx-2.3 pc-q35-jammy-hpb-maxcpus pc-i440fx-focal-hpb pc-i440fx-disco pc-i440fx-4.0 pc-i440fx-groovy-hpb pc-i440fx-hirsute-hpb pc-i440fx-5.0 pc-i440fx-2.8 pc-q35-6.2 q35 pc-i440fx-eoan pc-q35-2.5 pc-i440fx-3.0 pc-q35-yakkety pc-q35-2.11"
MODELNAME="13th Gen Intel(R) Core(TM) i7-1360P"
RESERVED_CPU=""
RESERVED_MEM="16400120"
TOTAL_ZOMBIES="1"
VERSION="6.8.0"
VM_MAD="kvm"
ZOMBIES="358"

undefined method `[]' for nil:NilClass

Expected behavior The monitoring system should be able to detect NUMA capabilities changes and merge them properly into the host information or, at least, log it.

Details

Additional context The CLI error comes from the function merge_numa_monitoring in the onehost_helper.rb CLI helper.

    def merge_numa_monitoring(numa_nodes, monitoring)
        return if monitoring.nil?

        monitoring = [monitoring] if monitoring.class == Hash
        numa_nodes.each do |node|
            mon_node = monitoring.find {|x| x['NODE_ID'] == node['NODE_ID'] }

            node['MEMORY']['FREE'] = mon_node['MEMORY']['FREE']
            node['MEMORY']['USED'] = mon_node['MEMORY']['USED']

            node['HUGEPAGE'].each do |hp|
                mon_hp = mon_node['HUGEPAGE'].find {|x| x['SIZE'] == hp['SIZE'] }
                hp['FREE'] = mon_hp['FREE']

            end
        end
    end

The function parameters numa_nodes and monitoring have the following values: numa_nodes:

{"CORE"=>{"CORES"=>"-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- ", "FREE"=>128, "USED"=>0}, "HUGEPAGE"=>[{"PAGES"=>"0", "SIZE"=>"1048576", "USAGE"=>"0"}, {"PAGES"=>"0", "SIZE"=>"2048", "USAGE"=>"0"}], "MEMORY"=>{"DISTANCE"=>"0", "TOTAL"=>"32458772", "USAGE"=>"0"}, "NODE_ID"=>"0"}
{"CORE"=>{"CORES"=>"-- -- -- -- -- -- -- -- -- -- ", "FREE"=>20, "USED"=>0}, "HUGEPAGE"=>[{"PAGES"=>"0", "SIZE"=>"1048576", "USAGE"=>"0"}, {"PAGES"=>"0", "SIZE"=>"2048", "USAGE"=>"0"}], "MEMORY"=>{"DISTANCE"=>"1 0", "TOTAL"=>"66051828", "USAGE"=>"0"}, "NODE_ID"=>"1"}

monitoring:

{"HUGEPAGE"=>[{"FREE"=>"0", "SIZE"=>"2048"}, {"FREE"=>"0", "SIZE"=>"1048576"}], "MEMORY"=>{"FREE"=>"17756328", "USED"=>"14702444"}, "NODE_ID"=>"0"}

The monitoring parameter only includes information from a single NUMA node due to the hardware modification, causing the aforementioned inconsistency.

Progress Status

MiguelERuiz commented 1 month ago

Tested in kvm-ssh microenv version 6.10 the steps indicated:

# step 1: host creation
onehost create 127.0.0.1 --im kvm --vm kvm
ID: 2
# step 2: host update
onedb update-body host --id 2
# step 3: host information
onehost show 2

for the step 3, this is the output:

HOST 2 INFORMATION
ID                    : 2
NAME                  : XXX
CLUSTER               : XXX
STATE                 : MONITORED
IM_MAD                : kvm
VM_MAD                : kvm
LAST MONITORING TIME  : -

HOST SHARES
RUNNING VMS           : 0
MEMORY
  TOTAL               : 503G
  TOTAL +/- RESERVED  : 487.4G
  USED (REAL)         : 0K
  USED (ALLOCATED)    : 0K
CPU
  TOTAL               : 12800
  TOTAL +/- RESERVED  : 12800
  USED (REAL)         :
  USED (ALLOCATED)    : 0

MONITORING INFORMATION
ARCH="x86_64"
CGROUPS_VERSION="1"
CPUSPEED="0"
FENCE_IP="10.11.0.244"
HOSTNAME="eon17"
HYPERVISOR="kvm"
IM_MAD="kvm"
KVM_CPU_MODEL="EPYC-Milan"
KVM_CPU_MODELS="486 pentium pentium2 pentium3 pentiumpro qemu32 kvm32 cpu64-rhel5 cpu64-rhel6 qemu64 kvm64 Conroe Penryn Nehalem Nehalem-IBRS Westmere Westmere-IBRS SandyBridge SandyBridge-IBRS IvyBridge IvyBridge-IBRS Haswell-noTSX Haswell-noTSX-IBRS Broadwell-noTSX Broadwell-noTSX-IBRS Skylake-Client-noTSX-IBRS Icelake-Client Icelake-Client-noTSX Opteron_G1 Opteron_G2 Opteron_G3 EPYC EPYC-IBPB EPYC-Rome EPYC-Milan Dhyana"
KVM_MACHINES="pc-i440fx-rhel7.6.0 pc pc-q35-rhel8.6.0 pc-q35-rhel9.4.0 q35 pc-q35-rhel8.5.0 pc-q35-rhel8.3.0 pc-q35-rhel7.6.0 pc-q35-rhel8.4.0 pc-q35-rhel9.2.0 pc-q35-rhel8.2.0 pc-q35-rhel9.0.0 pc-q35-rhel8.0.0 pc-q35-rhel8.1.0"
MODELNAME="AMD EPYC 7713P 64-Core Processor"
RESERVED_CPU=""
RESERVED_MEM="16400120"
VERSION="6.4.4"
VM_MAD="kvm"

NUMA NODES

  ID CORES                                                                                                                                                                                            USED FREE
   0 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --  0    128
   1 -- -- -- -- -- -- -- -- -- --                                                                                                                                                                    0    20

NUMA MEMORY

 NODE_ID TOTAL    USED_REAL            USED_ALLOCATED       FREE
       0 503G     -                    0K                   -
       1 63G      -                    0K                   -

NUMA HUGEPAGES

 NODE_ID SIZE     TOTAL    FREE     USED
       0 1024M    0        -        0
       0 2M       0        -        0
       1 1024M    0        -        0
       1 2M       0        -        0

WILD VIRTUAL MACHINES

NAME                                                      IMPORT_ID  CPU     MEMORY

VIRTUAL MACHINES

  ID USER     GROUP    NAME                                                                                        STAT  CPU     MEM HOST                                                               TIME

As it is shown, we can see a numa section and not a Ruby exception