ONE incorrectly handles "failed" live migration

hydro-b commented 3 months ago

Description A VM live migration succeeds but ONE receives an exit code of "1" by migrate script and assumes the migration has not succeeded. While in practice the VM has been live migrated successfully (only the SYNC_TIME part failed).

To Reproduce

Unsure what conditions lead up to this bug. We only have this behavior on one cluster. The error message is:

Jun 26 10:43:38 oned3 oned[481791]: [VM 0][Z0][VMM][I]: Failed to execute virtualization driver operation: migrate.
Jun 26 10:43:38 oned3 oned[481791]: [Z0][VMM][D]: Message received: MIGRATE FAILURE 0 virsh --connect qemu:///system migrate --live  DOMAIN-ID qemu+ssh://some_host/system (9.769216494s) Error mirgating VM DOMAIN-ID to host some_host: undefined method `upcase' for nil:NilClass ["/var/tmp/one/vmm/kvm/migrate:234:in `<main>'"] ExitCode: 1
Jun 26 10:43:38 oned3 oned[481791]: [VM 0][Z0][VMM][E]: MIGRATE: virsh --connect qemu:///system migrate --live  DOMAIN-ID qemu+ssh://some_host/system (9.769216494s) Error mirgating VM DOMAIN-ID to host some_host: undefined method `upcase' for nil:NilClass ["/var/tmp/one/vmm/kvm/migrate:234:in `<main>'"] ExitCode: 1

The piece of code that fails:

/var/lib/one/remotes/vmm/kvm/migrate

    # Sync guest time
    if ENV['SYNC_TIME'].upcase == 'YES'
        cmds =<<~EOS
            (
              for I in $(seq 4 -1 1); do
                if #{virsh} --readonly dominfo #{@deploy_id}; then
                  #{virsh} domtime --sync #{@deploy_id} && exit
                  [ "\$I" -gt 1 ] && sleep 5
                else
                  exit
                fi
              done
            ) &>/dev/null &
        EOS

Turns out that the virsh command does not result in a list with domains, is therefore empty, and the error is not handled properly. So ideally above piece of code gets fixed so it checks if the hash is not empty. If it is empty then a clear error message should be returned: unable to obtain $domain with virsh command here.

And ideally ONE should not assume that the VM is still running as it is currently doing: Jun 27 15:55:18 oned1 oned[1434842]: [VM 0][Z0][LCM][I]: Fail to live migrate VM. Assuming that the VM is still RUNNING. It will detect the VM is in poweroff state and the hypervisor the VM is running on detects a zombie. Ideally this is handled better: do some extra checks to see where the VM is running.

I can reproduce the behavior of virsh not listing domains by executing the ssh command without LIBVIRT_URI exported. This sounds like it could be related to this bug: LIBVIRT_URI

Strange enough I do not hit this issue when I explicitly export export SYNC_TIME=yes. It does not matter what value SYNC_TIME is set to. Either no, yes, or even broken will then not trigger this issue. So maybe somehow, if this is indeed the issue, kvmrc ENV vars are read when SYNC_TIME is exported (and otherwise not?).

Expected behavior I expect ONE to handle the error gracefully instead of hitting an assert in the code. And to have ONE double check where a live migrated VM ended up living if a non zero exit code gets returned (instead of assuming the VM keeps on running on the source hypervisor).

Details

Hypervisor: KVM
Version: 6.8.3

Additional context There seems to be a specific pre-condition that has to be true to hit this bug as we see it only happen on one specific (dedicated) cloud but as of now it is unclear what this is. I can reproduce this issue so if further debug information has to be gathered, please let me know.

Ubuntu 22.04 ONE / hypervisors

Progress Status

[ ] Code committed
[ ] Testing - QA
[ ] Documentation (Release notes - resolved issues, compatibility, known issues)

rsmontero commented 2 months ago

So for the updating state, the monitor process will update the VM state if it is not running to poweroff (libvirt should keep it running). In the same way if the migration fails, the VM will be running in the source host.

hydro-b commented 2 months ago

So for the updating state, the monitor process will update the VM state if it is not running to poweroff (libvirt should keep it running). In the same way if the migration fails, the VM will be running in the source host.

Yes, I understand how this is currently handled. And on the destination host a "zombie" VM will be detected. This still leaves a time window for error (as long as storage fencing is not implemented). I think this behavior could be improved. To check for failure scenarios like this. One of the things ONE could do is perform a bit of extra checking on the destination host and double check if the VM is not there. And raise an error instead (with a helpful message). In this particular case it would not have helped, as it wasn't able to detect any domains running.

For my understanding: this issue has been closed. Was this already fixed in another PR for the 6.10 release? To be clear: the main issues here are: 1) Hitting the live-migration issue (ONE unable to detect running domains), 2) graceful handling of this error by the code (in case no domains are detected).

rsmontero commented 2 months ago

yes it is fixed in 6.10, but actually it solves the migration with SYNC_TIME part.

So maybe we can keep this open, for future improvements as we did not address any change to current behavior

OpenNebula / one

ONE incorrectly handles "failed" live migration #6634

Progress Status