OpenNebula / one

The open source Cloud & Edge Computing Platform bringing real freedom to your Enterprise Cloud 🚀
http://opennebula.io
Apache License 2.0
1.26k stars 485 forks source link

Incorrect NUMA Node and CPU Pinning During VM Migration #6772

Open feldsam opened 3 weeks ago

feldsam commented 3 weeks ago

/!\ To report a security issue please follow this procedure: [https://github.com/OpenNebula/one/wiki/Vulnerability-Management-Process]

Description The current implementation for Huge Pages support, as per the enhancement "Support use of huge pages without CPU pinning #6185," selects a NUMA node based on free resources. The scheduling mechanism effectively balances load across NUMA nodes. However, issues arise during VM migration, leading to inconsistencies.

To Reproduce

  1. Configure a VM to use Huge Pages and deploy it on a host.
  2. Initiate a migration using the standard SAVE/Restore or Live migration method.
  3. Observe that the VM continues to use the old NUMA node on the target host, even if the scheduler selects a different NUMA node based on the target host’s free resources.
  4. If there is insufficient memory in the old NUMA node on the target, the migration may fail.
  5. Deploy new VMs and note inconsistencies caused by incorrectly pinned VMs.

Expected behavior

Details

Additional context

Progress Status

feldsam commented 3 weeks ago

@paczerny Hi, can you help me with my commit https://github.com/FELDSAM-INC/one/commit/47641337a6748e92d4cd774b88fba902f0d4efd0

It doesn't work, only what you did is working - classic migrate using save/restore - pinned CPUs are properly cleaned. I try to implement it for live migration and for migration with poweroff, but it doesn't work. What I did wrong? Thanks!