cockpit-project / cockpit

Cockpit is a web-based graphical interface for servers.
http://www.cockpit-project.org/
GNU Lesser General Public License v2.1
11.08k stars 1.1k forks source link

testLibvirt (check_machines_dbus.TestMachinesDBus) often fails: #12978

Closed martinpitt closed 4 years ago

martinpitt commented 4 years ago

This is one of our most comon flakes right now, and pretty hard to get over in our PRs:

Traceback (most recent call last):
  File "test/verify/machineslib.py", line 497, in testLibvirt
    b.wait_in_text("tbody tr[data-row-id=vm-subVmTest1] th", "subVmTest1")
testlib.Error: timeout
martinpitt commented 4 years ago

Also happens on fedora-30, not just on RHEL 8.

martinpitt commented 4 years ago

Another top flake is in testCreate (example on rhel-8-1, example on ubuntu-1804), which at first sight looks similar:

Traceback (most recent call last):
  File "/build/cockpit-project/bots/make-checkout-workdir/test/verify/machineslib.py", line 1522, in testCreate
    start_vm=True,))
  File "/build/cockpit-project/bots/make-checkout-workdir/test/verify/machineslib.py", line 1316, in createTest
    runner.deleteVm(dialog) \
  File "/build/cockpit-project/bots/make-checkout-workdir/test/verify/machineslib.py", line 2259, in checkEnvIsEmpty
    b.wait_in_text("#virtual-machines-listing thead tr td", "No VM is running")
  File "/build/cockpit-project/bots/make-checkout-workdir/test/common/testlib.py", line 392, in wait_in_text
    self.wait_visible(selector)
  File "/build/cockpit-project/bots/make-checkout-workdir/test/common/testlib.py", line 361, in wait_visible
    self.wait_present(selector)
  File "/build/cockpit-project/bots/make-checkout-workdir/test/common/testlib.py", line 352, in wait_present
    self.wait_js_func('ph_is_present', selector)
  File "/build/cockpit-project/bots/make-checkout-workdir/test/common/testlib.py", line 346, in wait_js_func
    self.wait_js_cond("%s(%s)" % (func, ','.join(map(jsquote, args))))
  File "/build/cockpit-project/bots/make-checkout-workdir/test/common/testlib.py", line 343, in wait_js_cond
    self.raise_cdp_exception("timeout\nwait_js_cond", cond, result["exceptionDetails"], trailer)
  File "/build/cockpit-project/bots/make-checkout-workdir/test/common/testlib.py", line 175, in raise_cdp_exception
    raise Error("%s(%s): %s" % (func, arg, msg))
testlib.Error: timeout
wait_js_cond(ph_is_present("#virtual-machines-listing thead tr td"))

i. e. in both cases it waits for the machine to appear in the UI, but the list is empty. If that's something different, let's split this out into a separate issue, of course.

martinpitt commented 4 years ago

I have still seen this on current master. This isn't a problem with the tests, libvirt sometimes seems genuinely slow/stuck.

martinpitt commented 4 years ago

Examples:

skobyda commented 4 years ago

@KKoukiou So i don't know how far are you with testLibvirt flake, but when I tried to debug it few days ago, it seems that after we restart libvirt service, session resources are loaded immidiatelly, but system is resources takes too long to load (sometimes several seconds, sometimes tens of seconds), which later results in timeout.

martinpitt commented 4 years ago

@skobyda : To clarify, "restart libvirt service" -- is that just the system service? I. e. systemctl restart libvirtd (or similar)? The user libvirtd is completely independent from that, so that wouldn't affect the user service. You can try and restart that by pkill -ef libvirtd

skobyda commented 4 years ago

@martinpitt In our test first we do systemctl stop libvirtd.service and then thru UI button we start it by calling service.proxy(serviceName).start Sorry for confusion, that's what I meant by restart,

skobyda commented 4 years ago

So maybe it may be it's a libvirt-dbus bug. Step how to reproduce (reproduced on ubuntu 19.10):

#systemctl stop libvirtd.service
#systemctl start libvirtd.service
#gdbus call --system --dest org.libvirt --object-path /org/libvirt/QEMU --method org.libvirt.Connect.ListDomains 0

will result in Error: GDBus.Error:org.libvirt.Error: Failed to connect socket to '/var/run/libvirt/libvirt-sock': Permission denied . This is a error which we get if this test fails, because we cannot call libvirt-dbus APIs to poll VMs. Virsh and gbus call --session works just fine.

Before reporting a bug and providing a patch PR for this flaky test on my next working day, I will ask libvirt to look into this if it's truly a libvirt-dbus bug.

martinpitt commented 4 years ago

@skobyda: That sounds like https://bugs.launchpad.net/ubuntu/+source/libvirt-dbus/+bug/1802005 . test/verify/machineslib.py and check-machines-dbus already have several workarounds for this. However, that bug was only observed on ubuntu-stable so far, and I even believe it's fixed in the recent images. This issue is observed everywhere, also on Fedora and RHEL.

skobyda commented 4 years ago

I'm not saying it's ubuntu-stable specific bug. Even rhel-8-1 test fails have the "Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory" error message`, which could mean that ubuntu, fedora and rhel failures are caused by the same bug.

martinpitt commented 4 years ago

But "No such file or directory" != "Permission denied". The former rather means that the proxy is not running. Perhaps it crashed?

skobyda commented 4 years ago

Oh you are right. It seems that both of those problems have different origins. I send a fix for ubuntu-stable one: https://github.com/cockpit-project/cockpit/pull/13030

skobyda commented 4 years ago

Figured out what was the problem with TestLibvirt flake: https://github.com/cockpit-project/cockpit/pull/13032. The last flake left to fix out of all flakes mentioned here is now is this, which by looking at error message and place where flake happens seems unrelated to all the other flakes

martinpitt commented 4 years ago

I just had another look at the current instance of this flake, which is still happening a lot. Turns out that stopping libvirt causes it to crash:

Process 74933 (libvirtd) of user 0 dumped core.

Stack trace of thread 74938:
#0  0x00007f5f568c54d1 get_lens (libaugeas.so.0)
#1  0x00007f5f568c582b get_lens (libaugeas.so.0)
#2  0x00007f5f568c7663 lns_get (libaugeas.so.0)
#3  0x00007f5f568c10cd lens_get (libaugeas.so.0)
#4  0x00007f5f568c29e5 transform_load (libaugeas.so.0)
#5  0x00007f5f568a13dc aug_load (libaugeas.so.0)
#6  0x00007f5f56902e25 get_augeas (libnetcf.so.1)
#7  0x00007f5f569058d5 list_interface_ids.constprop.0 (libnetcf.so.1)
#8  0x00007f5f56928651 netcfConnectListAllInterfaces (libvirt_driver_interface.so)
#9  0x00007f5f6a45ee3f virConnectListAllInterfaces (libvirt.so.0)
#10 0x0000556e93f6280d remoteDispatchConnectListAllInterfacesHelper (libvirtd)
#11 0x00007f5f6a355c51 virNetServerProgramDispatch (libvirt.so.0)
#12 0x00007f5f6a35b01c virNetServerHandleJob (libvirt.so.0)
#13 0x00007f5f6a27746f virThreadPoolWorker (libvirt.so.0)
#14 0x00007f5f6a27675c virThreadHelper (libvirt.so.0)
#15 0x00007f5f6a0ba4e2 start_thread (libpthread.so.0)
#16 0x00007f5f69f9c6a3 __clone (libc.so.6)

I can't find an existing bug report for that, so this needs to be reported and naughtied or worked around.

martinpitt commented 4 years ago

I filed https://bugzilla.redhat.com/show_bug.cgi?id=1832801 about the crash.

martinpitt commented 4 years ago

Turns out @marusak already reported/naughtied this: https://github.com/cockpit-project/bots/issues/798

But we need to expand the pattern to also catch this variant.