Closed fidencio closed 7 months ago
It seems to be the issue with qmp, we didn't test TDVM with qmp, qmp is also not included in the example from this release https://raw.githubusercontent.com/canonical/tdx/main/guest-tools/run_td.sh
Is qmp necessary for you? How about remove qmp(-qmp unix:fd=3,server=on,wait=off in your command) and try again to check if it can be reproduced?
BTW, how do you stop TDVM(qemu process) after every test? poweroff VM or kill qemu? Does qemu processs stopped by true before you start a new one?
Is qmp necessary for you? How about remove qmp(-qmp unix:fd=3,server=on,wait=off in your command) and try again to check if it can be reproduced?
Let me see how easily this can be done from my side.
BTW, how do you stop TDVM(qemu process) after every test? poweroff VM or kill qemu? Does qemu processs stopped by true before you start a new one?
We shutdown the VM and after a second or so we send a SIGKILL to the process in case it's still up and running.
"Invalid read at addr 0xFED40000, size 1, region '(null)', reason: rejected"
"Invalid read at addr 0xFED40030, size 4, region '(null)', reason: rejected"
"Invalid read at addr 0xFED40014, size 4, region '(null)', reason: rejected"
"Invalid read at addr 0xFED40000, size 1, region '(null)', reason: rejected"
"Invalid read at addr 0xFED40030, size 4, region '(null)', reason: rejected"
"Invalid read at addr 0xFED40014, size 4, region '(null)', reason: rejected"
Depending on the run, I'm also seeing that ^^^ on the logs.
"Invalid read at addr 0xFED40000, size 1, region '(null)', reason: rejected" "Invalid read at addr 0xFED40030, size 4, region '(null)', reason: rejected" "Invalid read at addr 0xFED40014, size 4, region '(null)', reason: rejected" "Invalid read at addr 0xFED40000, size 1, region '(null)', reason: rejected" "Invalid read at addr 0xFED40030, size 4, region '(null)', reason: rejected" "Invalid read at addr 0xFED40014, size 4, region '(null)', reason: rejected"
Depending on the run, I'm also seeing that ^^^ on the logs.
This is known issue, the error is from TDVF(OVMF), just a warning due to some feature missing, will not fix now, will not cause TDX function issue.
Is qmp necessary for you? How about remove qmp(-qmp unix:fd=3,server=on,wait=off in your command) and try again to check if it can be reproduced?
Let me see how easily this can be done from my side.
BTW, how do you stop TDVM(qemu process) after every test? poweroff VM or kill qemu? Does qemu processs stopped by true before you start a new one?
We shutdown the VM and after a second or so we send a SIGKILL to the process in case it's still up and running.
SIGKILL should work, the same as our test, we run more than 100 cases, boot TDVM on the same host more than 100 times, didn't find such issue. I will try qmp and check if we can reproduce this issue.
Let me see how easily this can be done from my side.
Disabling qmp from our side is a no-go.
I finished the test boot TDVM 10times with "-qmp unix:/tmp/tdx-qmp.sock,server=on,wait=off", no issue. I have no experience about Kata, so I can't test with kata -qmp unix:fd3.
I think I got to the root cause of the issue.
Basically, Kata Containers expects QEMU to start without any error, which is not the case as reported on #21. As QEMU reports an error, Kata Containers ends up cancelling the VM creation and returning the error up in the chain.
All in all, we could try to come up with a mechanism to ignore that specific error, but it wouldn't fly upstream, thus the better way to solve this would be to have such issue fixed sooner than later, if possible, on the QEMU side.
More tests have been done on my side, and basically the error reported on #21, qemu: KVM_TDX_INIT_MEM_REGION failed Resource temporarily unavailable
, is what's causing this issue. It's not a red herring after all.
Closing this one in favour of #21
When running the Kata Containers test suite, we've noticed that every 3~5 tests we'll hit an issue with QEMU not starting, and the only error we get out of QEMU is:
After a retry, or a few retries, things will get back to normal, but this is enough to block us to use Ubuntu 23.10, as the errors are coming and going, and the time it takes to "get back to a normal state" is not something we can easily pinpoint.
Here goes information about the stack used:
And this is the command line used by Kata Containers:
The things I'm mostly interested to know are: