QEMU fails to start due to "exitting QMP loop, command cancelled: unknown"

fidencio commented 8 months ago

When running the Kata Containers test suite, we've noticed that every 3~5 tests we'll hit an issue with QEMU not starting, and the only error we get out of QEMU is:

exitting QMP loop, command cancelled: unknown

After a retry, or a few retries, things will get back to normal, but this is enough to block us to use Ubuntu 23.10, as the errors are coming and going, and the time it takes to "get back to a normal state" is not something we can easily pinpoint.

Here goes information about the stack used:

linux-image-6.5.0-1003-intel-opt: 6.5.0-1003.3
qemu-system-x86: 8.0.4+dfsg-1ubuntu3+tdx1.0
ovmf: 2023.05-2+tdx1.0

And this is the command line used by Kata Containers:

/usr/bin/qemu-system-x86_64 -name sandbox-c47f7c6b5299148d9ec81975c400657cca68b41868921673bc8a2554aa236fd5 -uuid 05a14848-ceae-41ca-bea0-97a19b1032b9 -machine q35,accel=kvm,kernel_irqchip=split,confidential-guest-support=tdx -cpu host,-vmx-rdseed-exit,pmu=off -qmp unix:fd=3,server=on,wait=off -m 2048M,slots=10,maxmem=129344M -device pci-bridge,bus=pcie.0,id=pci-bridge-0,chassis_nr=1,shpc=off,addr=2,io-reserve=4k,mem-reserve=1m,pref64-reserve=1m -device virtio-serial-pci,disable-modern=false,id=serial0 -device virtconsole,chardev=charconsole0,id=console0 -chardev socket,id=charconsole0,path=/run/vc/vm/c47f7c6b5299148d9ec81975c400657cca68b41868921673bc8a2554aa236fd5/console.sock,server=on,wait=off -device virtio-blk-pci,disable-modern=false,drive=image-cda848eb82d2b912,scsi=off,config-wce=off,share-rw=on,serial=image-cda848eb82d2b912 -drive id=image-cda848eb82d2b912,file=/opt/kata/share/kata-containers/kata-ubuntu-latest-tdx.image,aio=threads,format=raw,if=none,readonly=on -device virtio-scsi-pci,id=scsi0,disable-modern=false -object tdx-guest,sept-ve-disable=on,id=tdx -object rng-random,id=rng0,filename=/dev/urandom -device virtio-rng-pci,rng=rng0 -device vhost-vsock-pci,disable-modern=false,vhostfd=4,id=vsock-3053934737,guest-cid=3053934737 -device virtio-9p-pci,disable-modern=false,fsdev=extra-9p-kataShared,mount_tag=kataShared -fsdev local,id=extra-9p-kataShared,path=/run/kata-containers/shared/sandboxes/c47f7c6b5299148d9ec81975c400657cca68b41868921673bc8a2554aa236fd5/shared,security_model=none,multidevs=remap -netdev tap,id=network-0,vhost=on,vhostfds=5,fds=6 -device driver=virtio-net-pci,netdev=network-0,mac=32:e0:e7:0c:ba:c1,disable-modern=false,mq=on,vectors=4 -rtc base=utc,driftfix=slew,clock=host -global kvm-pit.lost_tick_policy=discard -vga none -no-user-config -nodefaults -nographic --no-reboot -object memory-backend-ram,id=dimm1,size=2048M,private=on -numa node,memdev=dimm1 -kernel /opt/kata/share/kata-containers/vmlinuz-6.7-121-confidential -append tsc=reliable no_timer_check rcupdate.rcu_expedited=1 i8042.direct=1 i8042.dumbkbd=1 i8042.nopnp=1 i8042.noaux=1 noreplace-smp reboot=k cryptomgr.notests net.ifnames=0 pci=lastbus=0 root=/dev/vda1 rootflags=data=ordered,errors=remount-ro ro rootfstype=ext4 console=hvc0 console=hvc1 debug systemd.show_status=true systemd.log_level=debug panic=1 nr_cpus=1 selinux=0 systemd.unit=kata-containers.target systemd.mask=systemd-networkd.service systemd.mask=systemd-networkd.socket scsi_mod.scan=none agent.log=debug agent.debug_console agent.debug_console_vport=1026 agent.log=debug initcall_debug -bios /usr/share/ovmf/OVMF.fd -pidfile /run/vc/vm/c47f7c6b5299148d9ec81975c400657cca68b41868921673bc8a2554aa236fd5/pid -smp 1,cores=1,threads=1,sockets=1,maxcpus=1

The things I'm mostly interested to know are:

Was this previously reported?
- If so, is there some known solution?
What kind of extra information is needed?

fanchen2 commented 8 months ago

It seems to be the issue with qmp, we didn't test TDVM with qmp, qmp is also not included in the example from this release https://raw.githubusercontent.com/canonical/tdx/main/guest-tools/run_td.sh

Is qmp necessary for you? How about remove qmp(-qmp unix:fd=3,server=on,wait=off in your command) and try again to check if it can be reproduced?

BTW, how do you stop TDVM(qemu process) after every test? poweroff VM or kill qemu? Does qemu processs stopped by true before you start a new one?

fidencio commented 8 months ago

Is qmp necessary for you? How about remove qmp(-qmp unix:fd=3,server=on,wait=off in your command) and try again to check if it can be reproduced?

Let me see how easily this can be done from my side.

BTW, how do you stop TDVM(qemu process) after every test? poweroff VM or kill qemu? Does qemu processs stopped by true before you start a new one?

We shutdown the VM and after a second or so we send a SIGKILL to the process in case it's still up and running.

fidencio commented 8 months ago

"Invalid read at addr 0xFED40000, size 1, region '(null)', reason: rejected"
"Invalid read at addr 0xFED40030, size 4, region '(null)', reason: rejected"
"Invalid read at addr 0xFED40014, size 4, region '(null)', reason: rejected"
"Invalid read at addr 0xFED40000, size 1, region '(null)', reason: rejected"
"Invalid read at addr 0xFED40030, size 4, region '(null)', reason: rejected"
"Invalid read at addr 0xFED40014, size 4, region '(null)', reason: rejected"

Depending on the run, I'm also seeing that ^^^ on the logs.

fanchen2 commented 8 months ago

"Invalid read at addr 0xFED40000, size 1, region '(null)', reason: rejected"
"Invalid read at addr 0xFED40030, size 4, region '(null)', reason: rejected"
"Invalid read at addr 0xFED40014, size 4, region '(null)', reason: rejected"
"Invalid read at addr 0xFED40000, size 1, region '(null)', reason: rejected"
"Invalid read at addr 0xFED40030, size 4, region '(null)', reason: rejected"
"Invalid read at addr 0xFED40014, size 4, region '(null)', reason: rejected"

Depending on the run, I'm also seeing that ^^^ on the logs.

This is known issue, the error is from TDVF(OVMF), just a warning due to some feature missing, will not fix now, will not cause TDX function issue.

fanchen2 commented 8 months ago

Is qmp necessary for you? How about remove qmp(-qmp unix:fd=3,server=on,wait=off in your command) and try again to check if it can be reproduced?

Let me see how easily this can be done from my side.

BTW, how do you stop TDVM(qemu process) after every test? poweroff VM or kill qemu? Does qemu processs stopped by true before you start a new one?

We shutdown the VM and after a second or so we send a SIGKILL to the process in case it's still up and running.

SIGKILL should work, the same as our test, we run more than 100 cases, boot TDVM on the same host more than 100 times, didn't find such issue. I will try qmp and check if we can reproduce this issue.

fidencio commented 8 months ago

Let me see how easily this can be done from my side.

Disabling qmp from our side is a no-go.

fanchen2 commented 8 months ago

I finished the test boot TDVM 10times with "-qmp unix:/tmp/tdx-qmp.sock,server=on,wait=off", no issue. I have no experience about Kata, so I can't test with kata -qmp unix:fd3.

fidencio commented 8 months ago

I think I got to the root cause of the issue.

Basically, Kata Containers expects QEMU to start without any error, which is not the case as reported on #21. As QEMU reports an error, Kata Containers ends up cancelling the VM creation and returning the error up in the chain.

All in all, we could try to come up with a mechanism to ignore that specific error, but it wouldn't fly upstream, thus the better way to solve this would be to have such issue fixed sooner than later, if possible, on the QEMU side.

fidencio commented 8 months ago

More tests have been done on my side, and basically the error reported on #21, qemu: KVM_TDX_INIT_MEM_REGION failed Resource temporarily unavailable, is what's causing this issue. It's not a red herring after all.

fidencio commented 7 months ago

Closing this one in favour of #21

canonical / tdx

QEMU fails to start due to "exitting QMP loop, command cancelled: unknown" #22