cockpit-project / cockpit-machines

Cockpit UI for virtual machines
GNU Lesser General Public License v2.1
276 stars 72 forks source link

Fix race condition in testCreateDownloadAnOS() on tumbleweed, disable broken SELinux on rawhide #1724

Closed martinpitt closed 1 month ago

martinpitt commented 1 month ago

This test almost always fails on tumbleweed, but never ever locally. I even tried to reproduce the exact series of nondestructive tests:

TEST_OS=opensuse-tumbleweed test/common/run-tests -tv TestMachinesStoragePools.testStoragePoolsDeletion TestMachinesSettings.testBootOrder TestMachinesNetworks.testNetworkState TestMachinesNICs.testNICDelete TestMachinesHostDevs.testHostDevicesList TestMachinesFilesystems.testBasicSessionConnection TestMachinesDisks.testDetachDisk TestMachinesCreate.testDisabledCreate TestMachinesCreate.testCreateImportDisk TestMachinesCreate.testCreateDownloadAnOS

The weather report also shows that this fails exclusively on tumbleweed, in more than 77% of runs.

Turning debugging to 11.

I'll give this a little bit of investigation, but only so much -- if I can't figure it out quickly, I'll disable the test for now.

@Nykseli @Lunarequest @SludgeGirl FYI

Lunarequest commented 1 month ago

I actually just ran into this on a machine in our lab. Consistently crashed 3 times before the vm started

martinpitt commented 1 month ago

So here we have debug output, with a broken test log. Locally, I have a working log. I'll do a deep dive after lunch.

martinpitt commented 1 month ago

In the working case, we see the call to Destroy and Undefine followed fairly quickly by the DomainEvent [5,1] and [1,0] signals, which decode to "stopped" and "undefined". In the failing case, these exist as well, but there are two extra events in between:

debug: signal on /org/libvirt/QEMU: org.libvirt.Connect.DomainEvent(["/org/libvirt/QEMU/domain/_9c732434_c139_4745_8df4_dd54596e5486",5,1]) debug: handle DomainEvent on system: ignoring event AgentEvent debug: signal on /org/libvirt/QEMU: org.libvirt.Connect.DomainEvent(["/org/libvirt/QEMU/domain/_9c732434_c139_4745_8df4_dd54596e5486",4,0]) debug: signal on /org/libvirt/QEMU: org.libvirt.Connect.DomainEvent(["/org/libvirt/QEMU/domain/_9c732434_c139_4745_8df4_dd54596e5486",2,0]) debug: libvirt call: system /org/libvirt/QEMU/domain/_9c732434_c139_4745_8df4_dd54596e5486 org.libvirt.Domain GetXMLDesc [1] {"timeout":30000,"type":"u"} debug: signal on /org/libvirt/QEMU: org.libvirt.Connect.DomainEvent(["/org/libvirt/QEMU/domain/_9c732434_c139_4745_8df4_dd54596e5486",1,0])

which mean (4) "Resumed" and (2) "Started". So these supersede the previous "Stopped" event. This explains the screenshot where the VM is indeed shown as "Running".

So that's a typical race condition which is entirely timing dependent. Testing a fix in commit 6399fdbafcfed310bb609e06dc00279b38c3be73 which waits for the initial state, but it failed.

Testing with a sleep in db28cac4e3401e790be6cb30fbc66c27b4e1620a, just to make sure that this is really it (not a real solution of course). But this is still not enough.

I also tried a "proper" wait:

--- test/check-machines-create
+++ test/check-machines-create
@@ -1342,7 +1342,9 @@ vnc_password= "{vnc_passwd}"
                     user_login = virt_install_cmd_out.split("user-login=", 1)[1].split(",")[0].rstrip()
                     self.assertIn(user_login, self.user_login)

-            time.sleep(5)
+            # wait for virt-install to finish
+            if self.create_and_run:
+                self.machine.execute(f"while {virt_install_cmd}; do sleep 1; done", timeout=300)

         def fill(self):
             b = self.browser

With the default 60s timeout that fails (times out). With 300s timeout it eventually goes away. This is super annoying, but let's see how far it gets: commit 3cf37a485670710f2ab683e177069c72153d876e

martinpitt commented 1 month ago

OK, succeeded . There's something wrong with virt-install on tumbleweed, but I'm running out of steam/time and domain knowledge here. Let's put that in.

Nykseli commented 1 month ago

Thanks @martinpitt for taking a look and figuring it out!

I should have time tomorrow to take a better look at this. Hopefully I get lucky and find what's going on with virt-install :slightly_smiling_face:

martinpitt commented 1 month ago

Awesome -- this now fails everywhere except tumbleweed. So I'll restrict the workaround for now.

this failure is unrelated and also very high on the weather report. But one thing at a time..

martinpitt commented 1 month ago

Meh, this rawhide failure is entirely unrelated and new, also happened in https://github.com/fedora-selinux/selinux-policy/pull/2235:

error: Failed to get libvirt version from the dbus API: {"problem":null,"name":"org.libvirt.Error","message":"Failed to connect socket to '/var/run/libvirt/virtqemud-sock': Permission denied"}

This feels like the same as in https://issues.redhat.com/browse/RHEL-46893 - let's discuss in the selinux-policy PR.

martinpitt commented 1 month ago

Filed https://bugzilla.redhat.com/show_bug.cgi?id=2297965 . I added another commit to disable SELinux in rawhide, similar to our already existing RHEL 10 hack. sigh

mvollmer commented 1 month ago

Awesome -- this now fails everywhere except tumbleweed.

@martinpitt, do you have an idea why?

martinpitt commented 1 month ago

@martinpitt, do you have an idea why?

Unfortunately not. See my debugging notes above for everything else I've tried. The exact inner workings of virt-install are still mysterious to me -- sometimes it keeps running forever, sometimes it finishes, I can't predict it :cry:

So this is indeed just a "throw hands into the air" hack, and I'm sure it'll bite back some day.

Nykseli commented 1 month ago

I tried to look into it but wasn't able to figure out anything either. I guess I'll have to leave it be for now and get back to it if and when it starts breaking again :sweat_smile: