tortugaverde98 commented 6 years ago

Qubes 4.0

Affected component(s):

All VMs

Steps to reproduce the behavior:

Boot computer, decrypt, and log in.

Expected behavior:

VMs start as usual, sys-usb, sys-net, sys-firewall, etc.

Actual behavior:

No VMs start. No USB items function. When attempting to start a VM, an error is provided:

ERROR: b' Logical Volume "vm-sys-usb-root-snap" already exists in volume group "qubes_dom0" \n'

Everything appeared normal when I shut it down last night, with the exception that I had to kill the vault VM since it repeatedly prompted for it. That was the first time vault didn't shut down cleanly on its own. Typically that only happens with the VPN VMs and it never caused a problem.

I rebooted repeatedly today in hopes it would resolve itself but to no avail.

Any VM which has internet access, when attempting to start, references vm-sys-net-root-snap in the error. Presumably because that is the first VM in the chain that must start.

Oddly, except sys-usb which references itself, VMs which do not have internet access, such as vault, when attempting to start, provide a different, puzzling error:

ERROR: b" Can't create snapshot vm-sys-whonix-private as origin vm-sys-whonix-private -snap is not suspended.\n Failed to suspend qubes_dom0/pool00 with queued messages.\n"

Why offline VMs should refer to sys-whonix is beyond me.

I lack the skills to troubleshoot this. Any and all help would be extremely appreciated.

General notes:

Related issues:

tortugaverde98 commented 6 years ago

I ran: sudo lvm

Template have three items listed: templatename-private templatename-root templatename-root-152682xxxx-back

Most app VMs have two items: vmname-private vmname-15282xxxxx-private

But sys-usb, sys-net, and sys-firewall have five: -private -private-15282xxxxx-back -private-snap -root-snap -volatile

sys-whonix has four: -private -private-15282xxxxx-back -private-snap -private-tmp

I'm at risk of losing a couple weeks worth of intense work. Please advise.

marmarek commented 6 years ago

You can remove -root-snap volume of VMs not currently running (sys-usb, sys-net etc) - it is created during VM startup (as a snapshot of template), but apparently it failed to remove it after previous shutdown (a crash?)

But first check one more thing - in lvs command output there is Data% and Meta% column - check if any of volumes isn't full or nearly full, especially in Meta% column.

tortugaverde98 commented 6 years ago

Data% for pool00 is 62.11 Meta for pool00 is 34.22

The data column is populated for all items, but meta has data only for pool00.

I ran: sudo lvremove qubes_dom0 vm-sys-usb-root-snap

After a brief heart attack, i said 'n' to removing the pool(!). Then i said no to every other prompt, watching for vm-sys-usb-root-snap. However, no -root-snaps were ever presented in the list of items to which i had to declare y/n.

Yet they continue to exist in sudo lvm.

Another note is when declaring no to -private-snaps for sys-usb and sys-net, instead of the usual response 'logical volume xxx not removed', additional errors were displayed: Can't create snapshot vm-sys-whonix-private as origin vm-sys-whonix-private-snap is not suspended. Failed to suspend qubes_dom0/pool00 with queued messages. Failed to update pool qubes_dom0/pool00.

Is there a way i can attempt to suspend vm-sys-whonix-private-snap? Is that what is holding everything up?

tortugaverde98 commented 6 years ago

The Origin column is blank except for sys-usb, sys-net, sys-firewall, and sys-whonix. For usb, net, and fw, origin shows -private and the template -root.

For sys-whonix, origin shows -snap and two -tmp.

I attempted qvm-shutdown on all four of these VMs, but gave Shutdown error: domain is powered off. No VMs are running.

tortugaverde98 commented 6 years ago

I ran: sudo lvremove vm-sys-whonix-private-tmp

But it is still there, referenced twice in Origin column, once for -private-15268xxxx-back and once for -snap.

However, when i ran it again, it did not appear in the list of items for y/n, but still listed in sudo lvm.

Attempted: qvm-remove sys-whonix qvm-remove: error: Domain is in use: 'sys-whonix'; details in system log

systemctl | grep sys-whonix qubes-vm@sys-whonix.service loaded failed failed Start Qubes VM sys-whonix

Rebooted, no change.

marmarek commented 6 years ago

sudo lvremove qubes_dom0 vm-sys-usb-root-snap

Should be lvremove qubes_dom0/vm-sys-usb-root-snap

Can't create snapshot vm-sys-whonix-private as origin vm-sys-whonix-private-snap is not suspended. Failed to suspend qubes_dom0/pool00 with queued messages. Failed to update pool qubes_dom0/pool00.

Is sys-whonix running?

qvm-remove sys-whonix qvm-remove: error: Domain is in use: 'sys-whonix'; details in system log

See journalctl command for details. Probably it is set as a netvm to anon-whonix and whonix-ws-dvm. Anyway, removing sys-whonix shouldn't be needed to fix your problem.

tortugaverde98 commented 6 years ago

I ran lvremove qubes_dom0/vm-sys-usb-root-snap Results in this error: Can't create snapshot vm-sys-whonix-private as origin vm-sys-whonix-private -snap is not suspended. Failed to suspend qubes_dom0/pool00 with queued messages. Failed to update pool qubes_dom0/pool00.

Sys- whonix is not running. I also set it to autostart false and rebooted, no change.

Journalctl provides a mountain of info. I cannot scroll up to find a root cause.

tortugaverde98 commented 6 years ago

Success. I ran: sudo lvremove qubes_dom0/vm-sys-whonix-private-snap

It successfully deleted. From there all VMs can start. I rebooted qubes to be sure. Everything comes up cleanly.

subvertising commented 6 years ago

Hello guys. I met the same problem. Just one hour ago I manually shutted down my Qubes 4 because it freezes during the work. After reboot I try to run any AppVM and get:

ERROR: b" Can't create snapshot vm-sys-firewall-private-15288xxxxx-back as origin vm-sys-firewall-private is not suspended.\n Failed to suspend qubes_dom0/pool00 with queued messages.\n"

Then I try to start sys-net and see following:

ERROR: b' Lobical Volume "vm-sys-net-volatile" already exists in volume group "qubes_dom0"\n'

I don't know what to do to fix it. I tried yours " sudo lvremove qubes_dom0/vm-sys-whonix-private-snap" but, as I think (I'm noob) it is not connected with sys-net and sys-firewall in my case.

Then I tried
sudo lvremove qubes_dom0/vm-sys-firewall-private-snapand sudo lvremove qubes_dom0/vm-sys-firewall-private-15288xxxxx-back with no succes.

Please help me. I can't start any my AppVM and it's critically for me.

subvertising commented 6 years ago

When I try: sudo lvremove qubes_dom0/vm-sys-net-volatile I recieve:

Can't create snapshot vm-sys-firewall-private-15288xxxxx-back as origin vm-sys-firewall-private-snap is not suspended. Failed to suspend qubes_dom0/pool00 with queued messages. Failed to update pool qubes_dom0/pool00.

marmarek commented 5 years ago

Handling of unclean shutdown is already fixed with https://github.com/QubesOS/qubes-core-admin/commit/11c7b4bb512023b76cc9a70c987abceea8e0e785, which is part of qubes-core-dom0 >= 4.0.28.

whatevrr commented 5 years ago

It seems I hit similar problem on Thursday. I installed latest dom0 updates (including e.g. qubes-manager-4.0.26-1; ), rebooted, tried to update all relevant templates using the new, now (after this last update) functioning, qubes-update-gui. Update went fine (except for whonix-14 - a problem with a fix already described elsewhere), but after finishing it, while I was trying to restart all AppVMs, system (dom0) stopped responding on (Debian 9 based) network VM (sys-net-deb) startup (used to happen in the past on each start with NIC assigned without no strict reset AND permissive mode specified - a problem already solved, but, as I assumed, maybe somehow reintroduced by some setting change resulting from the last update).

I forced (HW button) system restart and, seeing that sys-net-deb startup on Qubes boot failed, reselected sys-net-deb assigned devices and reenabled no-strict-reset for these devices using Qube Manager. On the next restart I finally noticed it's not just sys-net-deb that isn't starting.

Now, even after few succeeding restarts, no VM (except dom0) is starting.

I get these errors on each attempted VM startup:

For VMs connected to `sys-net-deb` through `sys-firewall`:

Domain sys-net-deb failed to start:
Logical Volume "vm-sys-net-deb-volatile" already exists in volume group "qubes_dom0"

or its vm-sys-net-deb-root-snap or sys-net-deb-private-snap variant.

sys-usb (no network, USB controller assigned) throws similar errors, but referencing itself instead of sys-net-deb.

For most VMs without network VM assigned:

Domain <JustTryingToStartDomainName> has failed to start: device-mapper:message ioctl on (253:3) failed: File exists
 Failed to process thin pool message "create_snap 1666 1649".
 Failed to suspend qubes_dom0/pool00 with queued messages.

Any attempt to sudo lvremove qubes_dom0/vm-<SomeName>-volatile, -root-snap or -private-snap to date ended unsuccessfully with:

device-mapper:message ioctl on (253:3) failed: File exists
 Failed to process thin pool message "create_snap 1666 1649".
 Failed to suspend qubes_dom0/pool00 with queued messages.
 Failed to update pool qubes_dom0/pool00.

lvs output:

pool00 Data%: 74.56, Meta%: 50.37
one of the appvms -private and -private-*-back Data% at around 98% (I don't assume it means anything - it hasn't been run for ages and it's just an AppVM)

Output for different VMs:

for most AppVMs there is only vm-<Name>-private and corresponding -*-back entry
for AppVMs with assigned devices (sys-net-deb, sys-usb) there are -volatile, -private-snap and -root-snap entries as well; the same applies to whonix-gw-14 (although it has no device attached (and never had), it is a template and is set to standard PVH virtualization mode) and two (random?) AppVMs (one based on Debian 9, one on Fedora 28) which I don't remember running recently, nor having ever any devices assigned
for most templates there are just -private, -root and root-*-back entries

tasket commented 5 years ago

@whatevrr @subvertising @marmarek Problem is apparently caused by queued metadata that LVM can't process (maybe power was lost while a snapshot was in progress).

I encountered a solution here that uses vgcfgbackup and vgcfgrestore like so:

    # vgcfgbackup -f /tmp/pool0-current vg01
    # cp /tmp/pool0-current /tmp/pool0-current-orig # backup the file before making changes
    # vim /tmp/pool0-current # remove the message1 section in vg01 -> logical_volumes -> pool0
    # vgcfgrestore -f /tmp/pool0-current vg01 force

andrewdavidwong commented 5 years ago

I'm also experiencing this bug with qubes-core-dom0-4.0.39-1.fc25.

In my case, it happens without any unexpected shut down or power loss.

andrewdavidwong commented 5 years ago

Here's the journalctl log from the latest occurrence. In my case, the problem is resolved by restarting Qubes.

Apr 23 02:28:50 dom0 qubesd[10730]: Starting sys-whonix
Apr 23 02:28:51 dom0 qubesd[10730]: Starting sys-firewall
Apr 23 02:28:51 dom0 qubesd[10730]: Starting sys-net
Apr 23 02:28:51 dom0 qmemman.systemstate[10723]: do_balloon(memsize=577765376)
Apr 23 02:28:51 dom0 qmemman.systemstate[10723]: xenfree=657469362
Apr 23 02:29:06 dom0 qmemman.daemon.reqhandler[10723]: EOF
Apr 23 02:29:06 dom0 qubesd[10730]: Starting sys-whonix
Apr 23 02:29:06 dom0 qubesd[10730]: Starting sys-firewall
Apr 23 02:29:06 dom0 qubesd[10730]: Starting sys-net
Apr 23 02:29:06 dom0 qmemman.systemstate[10723]: do_balloon(memsize=577765376)
Apr 23 02:29:06 dom0 qmemman.systemstate[10723]: xenfree=657469362
Apr 23 02:29:21 dom0 qmemman.daemon.reqhandler[10723]: EOF
Apr 23 02:29:31 dom0 sudo[1379]:      user : TTY=pts/0 ; PWD=/home/user ; USER=root ; COMMAND=/bin/qubes-dom0-update --clean --check-only
Apr 23 02:29:31 dom0 audit[1379]: USER_CMD pid=1379 uid=1000 auid=1000 ses=2 msg='cwd="/home/user" cmd=71756265732D646F6D302D757064617465202D2D636C65616E202D2D636865636B2D6F6E6C79 terminal=pts/0 res=success'
Apr 23 02:29:31 dom0 kernel: audit: type=1123 audit(1556004571.296:12655): pid=1379 uid=1000 auid=1000 ses=2 msg='cwd="/home/user" cmd=71756265732D646F6D302D757064617465202D2D636C65616E202D2D636865636B2D6F6E6C79 terminal=pts/0 res=success'
Apr 23 02:29:31 dom0 audit[1379]: CRED_REFR pid=1379 uid=0 auid=1000 ses=2 msg='op=PAM:setcred grantors=pam_env,pam_unix acct="root" exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/0 res=success'
Apr 23 02:29:31 dom0 kernel: audit: type=1110 audit(1556004571.297:12656): pid=1379 uid=0 auid=1000 ses=2 msg='op=PAM:setcred grantors=pam_env,pam_unix acct="root" exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/0 res=success'
Apr 23 02:29:31 dom0 sudo[1379]: pam_systemd(sudo:session): Cannot create session: Already running in a session
Apr 23 02:29:31 dom0 audit[1379]: USER_START pid=1379 uid=0 auid=1000 ses=2 msg='op=PAM:session_open grantors=pam_keyinit,pam_limits,pam_keyinit,pam_limits,pam_systemd,pam_unix acct="root" exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/0 res=success'
Apr 23 02:29:31 dom0 sudo[1379]: pam_unix(sudo:session): session opened for user root by (uid=0)
Apr 23 02:29:31 dom0 kernel: audit: type=1105 audit(1556004571.299:12657): pid=1379 uid=0 auid=1000 ses=2 msg='op=PAM:session_open grantors=pam_keyinit,pam_limits,pam_keyinit,pam_limits,pam_systemd,pam_unix acct="root" exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/0 res=success'
Apr 23 02:29:32 dom0 sudo[1379]: pam_unix(sudo:session): session closed for user root
Apr 23 02:29:32 dom0 audit[1379]: USER_END pid=1379 uid=0 auid=1000 ses=2 msg='op=PAM:session_close grantors=pam_keyinit,pam_limits,pam_keyinit,pam_limits,pam_systemd,pam_unix acct="root" exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/0 res=success'
Apr 23 02:29:32 dom0 audit[1379]: CRED_DISP pid=1379 uid=0 auid=1000 ses=2 msg='op=PAM:setcred grantors=pam_env,pam_unix acct="root" exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/0 res=success'
Apr 23 02:29:32 dom0 kernel: audit: type=1106 audit(1556004572.244:12658): pid=1379 uid=0 auid=1000 ses=2 msg='op=PAM:session_close grantors=pam_keyinit,pam_limits,pam_keyinit,pam_limits,pam_systemd,pam_unix acct="root" exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/0 res=success'
Apr 23 02:29:32 dom0 kernel: audit: type=1104 audit(1556004572.244:12659): pid=1379 uid=0 auid=1000 ses=2 msg='op=PAM:setcred grantors=pam_env,pam_unix acct="root" exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/0 res=success'
Apr 23 02:29:33 dom0 qubesd[10730]: Starting fedora-29
Apr 23 02:29:33 dom0 qmemman.systemstate[10723]: do_balloon(memsize=426770432)
Apr 23 02:29:33 dom0 qmemman.systemstate[10723]: xenfree=657469362
Apr 23 02:29:33 dom0 qubesd[10730]: Task exception was never retrieved
Apr 23 02:29:33 dom0 qubesd[10730]: future:  exception=StoragePoolException('  Logical volume qubes_dom0/vm-sys-net-root-snap in use.',)>
Apr 23 02:29:33 dom0 qubesd[10730]: Traceback (most recent call last):
Apr 23 02:29:33 dom0 qubesd[10730]:   File "/usr/lib64/python3.5/asyncio/tasks.py", line 240, in _step
Apr 23 02:29:33 dom0 qubesd[10730]:     result = coro.send(None)
Apr 23 02:29:33 dom0 qubesd[10730]:   File "/usr/lib/python3.5/site-packages/qubes/storage/lvm.py", line 283, in wrapper
Apr 23 02:29:33 dom0 qubesd[10730]:     return (yield from method(self, *args, **kwargs))
Apr 23 02:29:33 dom0 qubesd[10730]:   File "/usr/lib/python3.5/site-packages/qubes/storage/lvm.py", line 681, in stop
Apr 23 02:29:33 dom0 qubesd[10730]:     yield from qubes_lvm_coro(cmd, self.log)
Apr 23 02:29:33 dom0 qubesd[10730]:   File "/usr/lib/python3.5/site-packages/qubes/storage/lvm.py", line 814, in qubes_lvm_coro
Apr 23 02:29:33 dom0 qubesd[10730]:     return _process_lvm_output(p.returncode, out, err, log)
Apr 23 02:29:33 dom0 qubesd[10730]:   File "/usr/lib/python3.5/site-packages/qubes/storage/lvm.py", line 787, in _process_lvm_output
Apr 23 02:29:33 dom0 qubesd[10730]:     raise qubes.storage.StoragePoolException(err)
Apr 23 02:29:33 dom0 qubesd[10730]: qubes.storage.StoragePoolException:   Logical volume qubes_dom0/vm-sys-net-root-snap in use.
Apr 23 02:29:33 dom0 qubesd[10730]: Task exception was never retrieved
Apr 23 02:29:33 dom0 qubesd[10730]: future:  exception=StoragePoolException('  Logical volume qubes_dom0/vm-sys-net-volatile in use.',)>
Apr 23 02:29:33 dom0 qubesd[10730]: Traceback (most recent call last):
Apr 23 02:29:33 dom0 qubesd[10730]:   File "/usr/lib64/python3.5/asyncio/tasks.py", line 240, in _step
Apr 23 02:29:33 dom0 qubesd[10730]:     result = coro.send(None)
Apr 23 02:29:33 dom0 qubesd[10730]:   File "/usr/lib/python3.5/site-packages/qubes/storage/lvm.py", line 283, in wrapper
Apr 23 02:29:33 dom0 qubesd[10730]:     return (yield from method(self, *args, **kwargs))
Apr 23 02:29:33 dom0 qubesd[10730]:   File "/usr/lib/python3.5/site-packages/qubes/storage/lvm.py", line 684, in stop
Apr 23 02:29:33 dom0 qubesd[10730]:     yield from qubes_lvm_coro(cmd, self.log)
Apr 23 02:29:33 dom0 qubesd[10730]:   File "/usr/lib/python3.5/site-packages/qubes/storage/lvm.py", line 814, in qubes_lvm_coro
Apr 23 02:29:33 dom0 qubesd[10730]:     return _process_lvm_output(p.returncode, out, err, log)
Apr 23 02:29:33 dom0 qubesd[10730]:   File "/usr/lib/python3.5/site-packages/qubes/storage/lvm.py", line 787, in _process_lvm_output
Apr 23 02:29:33 dom0 qubesd[10730]:     raise qubes.storage.StoragePoolException(err)
Apr 23 02:29:33 dom0 qubesd[10730]: qubes.storage.StoragePoolException:   Logical volume qubes_dom0/vm-sys-net-volatile in use.
Apr 23 02:29:33 dom0 qubesd[10730]: Task exception was never retrieved
Apr 23 02:29:33 dom0 qubesd[10730]: future:  exception=StoragePoolException('  Logical Volume "vm-sys-net-root-snap" already exists in volume group "qubes_dom0"',)>
Apr 23 02:29:33 dom0 qubesd[10730]: Traceback (most recent call last):
Apr 23 02:29:33 dom0 qubesd[10730]:   File "/usr/lib64/python3.5/asyncio/tasks.py", line 240, in _step
Apr 23 02:29:33 dom0 qubesd[10730]:     result = coro.send(None)
Apr 23 02:29:33 dom0 qubesd[10730]:   File "/usr/lib/python3.5/site-packages/qubes/storage/lvm.py", line 283, in wrapper
Apr 23 02:29:33 dom0 qubesd[10730]:     return (yield from method(self, *args, **kwargs))
Apr 23 02:29:33 dom0 qubesd[10730]:   File "/usr/lib/python3.5/site-packages/qubes/storage/lvm.py", line 666, in start
Apr 23 02:29:33 dom0 qubesd[10730]:     yield from self._snapshot()
Apr 23 02:29:33 dom0 qubesd[10730]:   File "/usr/lib/python3.5/site-packages/qubes/storage/lvm.py", line 657, in _snapshot
Apr 23 02:29:33 dom0 qubesd[10730]:     yield from qubes_lvm_coro(cmd, self.log)
Apr 23 02:29:33 dom0 qubesd[10730]:   File "/usr/lib/python3.5/site-packages/qubes/storage/lvm.py", line 814, in qubes_lvm_coro
Apr 23 02:29:33 dom0 qubesd[10730]:     return _process_lvm_output(p.returncode, out, err, log)
Apr 23 02:29:33 dom0 qubesd[10730]:   File "/usr/lib/python3.5/site-packages/qubes/storage/lvm.py", line 787, in _process_lvm_output
Apr 23 02:29:33 dom0 qubesd[10730]:     raise qubes.storage.StoragePoolException(err)
Apr 23 02:29:33 dom0 qubesd[10730]: qubes.storage.StoragePoolException:   Logical Volume "vm-sys-net-root-snap" already exists in volume group "qubes_dom0"
Apr 23 02:29:33 dom0 qubesd[10730]: Task exception was never retrieved
Apr 23 02:29:33 dom0 qubesd[10730]: future:  exception=StoragePoolException('  Logical Volume "vm-sys-net-volatile" already exists in volume group "qubes_dom0"',)>
Apr 23 02:29:33 dom0 qubesd[10730]: Traceback (most recent call last):
Apr 23 02:29:33 dom0 qubesd[10730]:   File "/usr/lib64/python3.5/asyncio/tasks.py", line 240, in _step
Apr 23 02:29:33 dom0 qubesd[10730]:     result = coro.send(None)
Apr 23 02:29:33 dom0 qubesd[10730]:   File "/usr/lib/python3.5/site-packages/qubes/storage/lvm.py", line 283, in wrapper
Apr 23 02:29:33 dom0 qubesd[10730]:     return (yield from method(self, *args, **kwargs))
Apr 23 02:29:33 dom0 qubesd[10730]:   File "/usr/lib/python3.5/site-packages/qubes/storage/lvm.py", line 668, in start
Apr 23 02:29:33 dom0 qubesd[10730]:     yield from self._reset()
Apr 23 02:29:33 dom0 qubesd[10730]:   File "/usr/lib/python3.5/site-packages/qubes/storage/lvm.py", line 367, in _reset
Apr 23 02:29:33 dom0 qubesd[10730]:     yield from qubes_lvm_coro(cmd, self.log)
Apr 23 02:29:33 dom0 qubesd[10730]:   File "/usr/lib/python3.5/site-packages/qubes/storage/lvm.py", line 814, in qubes_lvm_coro
Apr 23 02:29:33 dom0 qubesd[10730]:     return _process_lvm_output(p.returncode, out, err, log)
Apr 23 02:29:33 dom0 qubesd[10730]:   File "/usr/lib/python3.5/site-packages/qubes/storage/lvm.py", line 787, in _process_lvm_output
Apr 23 02:29:33 dom0 qubesd[10730]:     raise qubes.storage.StoragePoolException(err)
Apr 23 02:29:33 dom0 qubesd[10730]: qubes.storage.StoragePoolException:   Logical Volume "vm-sys-net-volatile" already exists in volume group "qubes_dom0"
Apr 23 02:29:33 dom0 qubesd[10730]: Task exception was never retrieved
Apr 23 02:29:33 dom0 qubesd[10730]: future:  exception=StoragePoolException('  Logical Volume "vm-sys-net-private-snap" already exists in volume group "qubes_dom0"',)>
Apr 23 02:29:33 dom0 qubesd[10730]: Traceback (most recent call last):
Apr 23 02:29:33 dom0 qubesd[10730]:   File "/usr/lib64/python3.5/asyncio/tasks.py", line 240, in _step
Apr 23 02:29:33 dom0 qubesd[10730]:     result = coro.send(None)
Apr 23 02:29:33 dom0 qubesd[10730]:   File "/usr/lib/python3.5/site-packages/qubes/storage/lvm.py", line 283, in wrapper
Apr 23 02:29:33 dom0 qubesd[10730]:     return (yield from method(self, *args, **kwargs))
Apr 23 02:29:33 dom0 qubesd[10730]:   File "/usr/lib/python3.5/site-packages/qubes/storage/lvm.py", line 666, in start
Apr 23 02:29:33 dom0 qubesd[10730]:     yield from self._snapshot()
Apr 23 02:29:33 dom0 qubesd[10730]:   File "/usr/lib/python3.5/site-packages/qubes/storage/lvm.py", line 657, in _snapshot
Apr 23 02:29:33 dom0 qubesd[10730]:     yield from qubes_lvm_coro(cmd, self.log)
Apr 23 02:29:33 dom0 qubesd[10730]:   File "/usr/lib/python3.5/site-packages/qubes/storage/lvm.py", line 814, in qubes_lvm_coro
Apr 23 02:29:33 dom0 qubesd[10730]:     return _process_lvm_output(p.returncode, out, err, log)
Apr 23 02:29:33 dom0 qubesd[10730]:   File "/usr/lib/python3.5/site-packages/qubes/storage/lvm.py", line 787, in _process_lvm_output
Apr 23 02:29:33 dom0 qubesd[10730]:     raise qubes.storage.StoragePoolException(err)
Apr 23 02:29:33 dom0 qubesd[10730]: qubes.storage.StoragePoolException:   Logical Volume "vm-sys-net-private-snap" already exists in volume group "qubes_dom0"
Apr 23 02:29:33 dom0 qubesd[10730]: Task exception was never retrieved
Apr 23 02:29:33 dom0 qubesd[10730]: future:  exception=StoragePoolException('  Logical Volume "vm-sys-net-root-snap" already exists in volume group "qubes_dom0"',)>
Apr 23 02:29:33 dom0 qubesd[10730]: Traceback (most recent call last):
Apr 23 02:29:33 dom0 qubesd[10730]:   File "/usr/lib64/python3.5/asyncio/tasks.py", line 240, in _step
Apr 23 02:29:33 dom0 qubesd[10730]:     result = coro.send(None)
Apr 23 02:29:33 dom0 qubesd[10730]:   File "/usr/lib/python3.5/site-packages/qubes/storage/lvm.py", line 283, in wrapper
Apr 23 02:29:33 dom0 qubesd[10730]:     return (yield from method(self, *args, **kwargs))
Apr 23 02:29:33 dom0 qubesd[10730]:   File "/usr/lib/python3.5/site-packages/qubes/storage/lvm.py", line 666, in start
Apr 23 02:29:33 dom0 qubesd[10730]:     yield from self._snapshot()
Apr 23 02:29:33 dom0 qubesd[10730]:   File "/usr/lib/python3.5/site-packages/qubes/storage/lvm.py", line 657, in _snapshot
Apr 23 02:29:33 dom0 qubesd[10730]:     yield from qubes_lvm_coro(cmd, self.log)
Apr 23 02:29:33 dom0 qubesd[10730]:   File "/usr/lib/python3.5/site-packages/qubes/storage/lvm.py", line 814, in qubes_lvm_coro
Apr 23 02:29:33 dom0 qubesd[10730]:     return _process_lvm_output(p.returncode, out, err, log)
Apr 23 02:29:33 dom0 qubesd[10730]:   File "/usr/lib/python3.5/site-packages/qubes/storage/lvm.py", line 787, in _process_lvm_output
Apr 23 02:29:33 dom0 qubesd[10730]:     raise qubes.storage.StoragePoolException(err)
Apr 23 02:29:33 dom0 qubesd[10730]: qubes.storage.StoragePoolException:   Logical Volume "vm-sys-net-root-snap" already exists in volume group "qubes_dom0"
Apr 23 02:29:33 dom0 qubesd[10730]: Task exception was never retrieved
Apr 23 02:29:33 dom0 qubesd[10730]: future:  exception=StoragePoolException('  Logical Volume "vm-sys-net-volatile" already exists in volume group "qubes_dom0"',)>
Apr 23 02:29:33 dom0 qubesd[10730]: Traceback (most recent call last):
Apr 23 02:29:33 dom0 qubesd[10730]:   File "/usr/lib64/python3.5/asyncio/tasks.py", line 240, in _step
Apr 23 02:29:33 dom0 qubesd[10730]:     result = coro.send(None)
Apr 23 02:29:33 dom0 qubesd[10730]:   File "/usr/lib/python3.5/site-packages/qubes/storage/lvm.py", line 283, in wrapper
Apr 23 02:29:33 dom0 qubesd[10730]:     return (yield from method(self, *args, **kwargs))
Apr 23 02:29:33 dom0 qubesd[10730]:   File "/usr/lib/python3.5/site-packages/qubes/storage/lvm.py", line 668, in start
Apr 23 02:29:33 dom0 qubesd[10730]:     yield from self._reset()
Apr 23 02:29:33 dom0 qubesd[10730]:   File "/usr/lib/python3.5/site-packages/qubes/storage/lvm.py", line 367, in _reset
Apr 23 02:29:33 dom0 qubesd[10730]:     yield from qubes_lvm_coro(cmd, self.log)
Apr 23 02:29:33 dom0 qubesd[10730]:   File "/usr/lib/python3.5/site-packages/qubes/storage/lvm.py", line 814, in qubes_lvm_coro
Apr 23 02:29:33 dom0 qubesd[10730]:     return _process_lvm_output(p.returncode, out, err, log)
Apr 23 02:29:33 dom0 qubesd[10730]:   File "/usr/lib/python3.5/site-packages/qubes/storage/lvm.py", line 787, in _process_lvm_output
Apr 23 02:29:33 dom0 qubesd[10730]:     raise qubes.storage.StoragePoolException(err)
Apr 23 02:29:33 dom0 qubesd[10730]: qubes.storage.StoragePoolException:   Logical Volume "vm-sys-net-volatile" already exists in volume group "qubes_dom0"
Apr 23 02:29:33 dom0 qubesd[10730]: Task exception was never retrieved
Apr 23 02:29:33 dom0 qubesd[10730]: future:  exception=StoragePoolException('  Logical Volume "vm-sys-net-private-snap" already exists in volume group "qubes_dom0"',)>
Apr 23 02:29:33 dom0 qubesd[10730]: Traceback (most recent call last):
Apr 23 02:29:33 dom0 qubesd[10730]:   File "/usr/lib64/python3.5/asyncio/tasks.py", line 240, in _step
Apr 23 02:29:33 dom0 qubesd[10730]:     result = coro.send(None)
Apr 23 02:29:33 dom0 qubesd[10730]:   File "/usr/lib/python3.5/site-packages/qubes/storage/lvm.py", line 283, in wrapper
Apr 23 02:29:33 dom0 qubesd[10730]:     return (yield from method(self, *args, **kwargs))
Apr 23 02:29:33 dom0 qubesd[10730]:   File "/usr/lib/python3.5/site-packages/qubes/storage/lvm.py", line 666, in start
Apr 23 02:29:33 dom0 qubesd[10730]:     yield from self._snapshot()
Apr 23 02:29:33 dom0 qubesd[10730]:   File "/usr/lib/python3.5/site-packages/qubes/storage/lvm.py", line 657, in _snapshot
Apr 23 02:29:33 dom0 qubesd[10730]:     yield from qubes_lvm_coro(cmd, self.log)
Apr 23 02:29:33 dom0 qubesd[10730]:   File "/usr/lib/python3.5/site-packages/qubes/storage/lvm.py", line 814, in qubes_lvm_coro
Apr 23 02:29:33 dom0 qubesd[10730]:     return _process_lvm_output(p.returncode, out, err, log)
Apr 23 02:29:33 dom0 qubesd[10730]:   File "/usr/lib/python3.5/site-packages/qubes/storage/lvm.py", line 787, in _process_lvm_output
Apr 23 02:29:33 dom0 qubesd[10730]:     raise qubes.storage.StoragePoolException(err)
Apr 23 02:29:33 dom0 qubesd[10730]: qubes.storage.StoragePoolException:   Logical Volume "vm-sys-net-private-snap" already exists in volume group "qubes_dom0"
Apr 23 02:29:33 dom0 dmeventd[2171]: No longer monitoring thin pool qubes_dom0-pool00-tpool.
Apr 23 02:29:33 dom0 lvm[2171]: Monitoring thin pool qubes_dom0-pool00-tpool.
Apr 23 02:29:34 dom0 dmeventd[2171]: No longer monitoring thin pool qubes_dom0-pool00-tpool.
Apr 23 02:29:34 dom0 lvm[2171]: Monitoring thin pool qubes_dom0-pool00-tpool.

andrewdavidwong commented 5 years ago

This seem to be happening more often now. My laptop can't go more than a few days without a reboot before hitting this. Here's what happens when I try lvremove:

$ sudo lvremove -vf qubes_dom0/vm-sys-net-root-snap
Logical volume qubes_dom0/vm-sys-net-root-snap in use.
$ sudo lvremove -vf qubes_dom0/vm-sys-net-private-snap
Logical volume qubes_dom0/vm-sys-net-private-snap in use.

The only workaround I've found is to reboot.

unman commented 5 years ago

On Tue, Jun 11, 2019 at 09:05:28PM -0700, Andrew David Wong wrote:

This seem to be happening more often now. My laptop can't go more than a few days without a reboot before hitting this. Here's what happens when I try lvmremove:
$ sudo lvremove -vf qubes_dom0/vm-sys-net-root-snap
Logical volume qubes_dom0/vm-sys-net-root-snap in use.
$ sudo lvremove -vf qubes_dom0/vm-sys-net-private-snap
Logical volume qubes_dom0/vm-sys-net-private-snap in use.
The only workaround I've found is to reboot.

Is that full uptime, or are you sleeping during this time? Are you running "testing"?

marmarek commented 5 years ago

Lets try to find out what is using those volumes. Is there any related VM still visible on xl list? Is there any VM with (null) name? Is any process in dom0 related to this VM still running (ps aux | grep VMNAME)?

andrewdavidwong commented 5 years ago

Is that full uptime, or are you sleeping during this time?

Not full uptime; frequent sleeping periods throughout.

Are you running "testing"?

Just security-testing.

Lets try to find out what is using those volumes. Is there any related VM still visible on xl list? Is there any VM with (null) name? Is any process in dom0 related to this VM still running (ps aux | grep VMNAME)?

I'll check this the next it happens. Thanks!

andrewdavidwong commented 5 years ago

Lets try to find out what is using those volumes.

It just happened again, so I was able to check this.

Is there any related VM still visible on xl list?

Yes:

Name                                        ID   Mem VCPUs  State   Time(s)
Domain-0                                     0  4080     2     r-----   73330.0
sys-net                                   1981   400     1     --p---       0.0
sys-net-dm                                1982   144     1     -b----       3.1

Is there any VM with (null) name?

No.

Is any process in dom0 related to this VM still running (ps aux | grep VMNAME)?

Yes:

root     12574  0.0  0.1 160576  6184 ?        SLsl 02:33   0:02 /usr/bin/qubes-guid -N sys-net -c 0xcc0000 -i /usr/share/icons/hicolor/128x128/devices/appvm-red.png -l 1 -q -d 1644 -n -f

I was able to use xl destroy to destroy sys-net. This allowed me to start sys-net normally with Qubes tools. It connected to my Wi-Fi network and worked normally. Other VMs that depend on sys-net were able to start normally after that.

However, after using xl destroy on sys-net, I was not able to shut down any VMs normally with Qubes tools. (They would go down to 0 MB in the Qubes Domains widget but not fully halt. The "Shutdown" button was grayed out in Qube Manager. Starting the VM again would halt it, then restart it.) So, I had to do a full reboot of the whole machine anyway.

marmarek commented 5 years ago

sys-net 1981 400 1 --p--- 0.0 sys-net-dm 1982 144 1 -b---- 3.1

Ok, so sys-net is paused, not powered off. It isn't surprising that starting it again fails - even if that wouldn't be LVM, then later it would fails as it's already started. There are two situations where VMs are paused in qubes:

During startup: first it's created paused, then some VM related services are initialized (qrexec, qubesdb) and only then VM is unpaused. There is a lock preventing concurrent startups of the same VM, but maybe in some case such paused VM is left there
During system suspend: every VM is either paused or suspended ("ss" in state field). sys-net should be suspended, but if that fails (which is a problem itself), it fallback to pause. You can easily spot this situation, as suspend takes significantly longer in this case - it involves 30-60s timeout.

But neither of those situations should allow the same VM to be started again. I've tried to reproduce similar thing manually and in all cases I've got proper "domains ... is already running" error when trying to start such paused VM. Can you provide more details what actions leads to this error? Check also if you see some errors related to libvirtd in journalctl (sudo journalctl -u libvirtd) between last successful VM start and the failed one.

So, I see two (probably related) issues here:

sys-net remains paused
there is an attempt to (re-)start paused VM, instead of failing early

As for xl destroy, use virsh -c xen destroy, details here.

andrewdavidwong commented 5 years ago

Can you provide more details what actions leads to this error?

It pretty reliably starts happening after ~4 days of uptime. I just use the system normally, starting and stopping VMs, suspending and resuming a few times a day, Updating TemplateVMs, shutting all VMs down and restarting them, etc. One non-default thing about my setup is that I used "named"/"static" DisposableVMs for sys-net and sys-firewall.

Check also if you see some errors related to libvirtd in journalctl (sudo journalctl -u libvirtd) between last successful VM start and the failed one.

Ok, I'll check this the next time it happens.

As for xl destroy, use virsh -c xen destroy

Will do, thanks. Should I expect to be able to use this in order to destroy sys-net when it gets stuck as a workaround to avoid having to reboot the whole system?

marmarek commented 5 years ago

It pretty reliably starts happening after ~4 days of uptime.

Does it happen right after suspend? Or maybe while restarting all the VMs (including sys-net)?

Should I expect to be able to use this in order to destroy sys-net when it gets stuck as a workaround to avoid having to reboot the whole system?

Yes.

andrewdavidwong commented 5 years ago

Does it happen right after suspend?

It doesn't seem related to suspending. Perhaps it sometimes happens right after resuming from suspend, but certainly not only after resuming from suspend.

Or maybe while restarting all the VMs (including sys-net)?

Yes, it usually seems to happen when I shut down all VMs including sys-net, then try to start some back up. (Well, I usually shut down all other VMs when I want to restart sys-net anyway.)

andrewdavidwong commented 5 years ago

As for xl destroy, use virsh -c xen destroy

Unfortunately, this did not work for me either:

[user@dom0 ~]$ xl list
Name                                        ID   Mem VCPUs  State   Time(s)
Domain-0                                     0  4080     2     r-----  101097.1
sys-net                                   1953   400     1     --p---       0.0
sys-net-dm                                1954   144     1     -b----     314.5
[user@dom0 ~]$ virsh -c xen destroy sys-net
error: Failed to destroy domain sys-net
error: Requested operation is not valid: Domain is not running

[user@dom0 ~]$ virsh -c xen destroy sys-net-dm
error: failed to get domain 'sys-net-dm'
error: Domain not found

[user@dom0 ~]$ xl list
Name                                        ID   Mem VCPUs  State   Time(s)
Domain-0                                     0  4080     2     r-----  101098.5
sys-net                                   1953   400     1     --p---       0.0
sys-net-dm                                1954   144     1     -b----     314.7

marmarek commented 5 years ago

Check /var/log/libvirt/libxl/libxl-driver.log about the time of sys-net start that lead to this state (so, the failed one, before the message from this issue title). Do you see anything similar to https://github.com/QubesOS/qubes-issues/issues/3322 ?

andrewdavidwong commented 5 years ago

Check /var/log/libvirt/libxl/libxl-driver.log about the time of sys-net start that lead to this state (so, the failed one, before the message from this issue title). Do you see anything similar to #3322 ?

The log is no longer available when it last occurred (probably because I had to reboot), but I'll check this the next time it happens.

marmarek commented 5 years ago

Not even rotated one (with date suffix)?

tasket commented 5 years ago

FWIW, the problem has not returned for me since February. I don't use named dispvms, however, and my suspend activity is lighter than Andrews. My uptime can vary a lot, but rarely exceed 5 days.

I have a few troubleshooting recommendations for Andrew:

If you revert back to normal sys-net and sys-firewall, does the error persist?
Does using vgcfgrestore help?
Try hardware/RAM diagnostics. Then follow that up with something like rpm -Va to verify all package contents in the system (maybe restrict this to qubes* packages at first).

tasket commented 5 years ago

I also meant to include this:

Thin pool health: In lvs output, is the Meta% for pool00 high? Also try booting from a rescue disk, unlock your system partition then run thin_check /dev/mapper/qubes_dom0-pool00_tmeta.

andrewdavidwong commented 5 years ago

Not even rotated one (with date suffix)?

Ah, indeed. I don't remember exactly when it happened, but here are all the unusual lines:

2019-06-21 03:28:16.834+0000: libxl: libxl_device.c:1093:device_backend_callback: unable to remove device with path /local/domain/1535/backend/vif/1540/0
2019-06-21 03:28:16.845+0000: libxl: libxl.c:1669:devices_destroy_cb: libxl__devices_destroy failed for 1540
2019-06-21 03:29:02.142+0000: libxl: libxl_linux.c:155:libxl__loopdev_cleanup: unable to release device /dev/loop0: No such device or address
2019-06-21 03:29:02.345+0000: libxl: libxl_linux.c:155:libxl__loopdev_cleanup: unable to release device /dev/loop0: No such device or address

2019-06-21 14:30:29.833+0000: libxl: libxl_linux.c:155:libxl__loopdev_cleanup: unable to release device /dev/loop0: No such device or address
2019-06-21 14:30:30.063+0000: libxl: libxl_linux.c:155:libxl__loopdev_cleanup: unable to release device /dev/loop0: No such device or address
2019-06-21 14:46:41.539+0000: libxl: libxl.c:1853:libxl_console_get_tty: unable to read console tty path `/local/domain/1939/console/tty': Resource temporarily unavailable
2019-06-21 23:33:08.859+0000: libxl: libxl.c:1853:libxl_console_get_tty: unable to read console tty path `/local/domain/1946/console/tty': Resource temporarily unavailable
2019-06-22 09:55:51.455+0000: libxl: libxl_device.c:1093:device_backend_callback: unable to remove device with path /local/domain/1938/backend/vif/1942/0
2019-06-22 09:55:51.461+0000: libxl: libxl.c:1669:devices_destroy_cb: libxl__devices_destroy failed for 1942
2019-06-22 09:55:52.903+0000: libxl: libxl_device.c:1093:device_backend_callback: unable to remove device with path /local/domain/1938/backend/vif/1944/0
2019-06-22 09:55:52.911+0000: libxl: libxl.c:1669:devices_destroy_cb: libxl__devices_destroy failed for 1944
2019-06-22 09:56:40.245+0000: libxl: libxl_linux.c:155:libxl__loopdev_cleanup: unable to release device /dev/loop0: No such device or address
2019-06-22 09:56:40.472+0000: libxl: libxl_linux.c:155:libxl__loopdev_cleanup: unable to release device /dev/loop0: No such device or address

2019-06-22 20:39:16.401+0000: libxl: libxl_pci.c:1176:libxl__device_pci_reset: The kernel doesn't support reset from sysfs for PCI device 0000:00:14.0
2019-06-22 20:40:33.078+0000: libxl: libxl_pci.c:1176:libxl__device_pci_reset: The kernel doesn't support reset from sysfs for PCI device 0000:00:14.0
2019-06-22 20:40:33.200+0000: libxl: libxl_pci.c:1494:do_pci_remove: xc_physdev_unmap_pirq irq=16: Invalid argument

2019-06-23 06:59:17.419+0000: libxl: libxl_device.c:1093:device_backend_callback: unable to remove device with path /local/domain/5/backend/vif/36/0
2019-06-23 06:59:17.455+0000: libxl: libxl.c:1669:devices_destroy_cb: libxl__devices_destroy failed for 36
2019-06-23 06:59:19.200+0000: libxl: libxl_device.c:1093:device_backend_callback: unable to remove device with path /local/domain/5/backend/vif/32/0
2019-06-23 06:59:19.204+0000: libxl: libxl.c:1669:devices_destroy_cb: libxl__devices_destroy failed for 32
2019-06-23 06:59:20.475+0000: libxl: libxl_device.c:1093:device_backend_callback: unable to remove device with path /local/domain/5/backend/vif/41/0
2019-06-23 06:59:20.485+0000: libxl: libxl.c:1669:devices_destroy_cb: libxl__devices_destroy failed for 41
2019-06-23 06:59:20.758+0000: libxl: libxl_device.c:1093:device_backend_callback: unable to remove device with path /local/domain/5/backend/vif/38/0
2019-06-23 06:59:20.766+0000: libxl: libxl.c:1669:devices_destroy_cb: libxl__devices_destroy failed for 38
2019-06-23 07:00:13.674+0000: libxl: libxl_linux.c:155:libxl__loopdev_cleanup: unable to release device /dev/loop0: No such device or address
2019-06-23 07:00:13.756+0000: libxl: libxl_linux.c:155:libxl__loopdev_cleanup: unable to release device /dev/loop0: No such device or address

andrewdavidwong commented 5 years ago

I have a few troubleshooting recommendations for Andrew: [...]

Thanks for the tips, Chris! I'll try these next if the logs above don't reveal anything.

Thin pool health: In lvs output, is the Meta% for pool00 high?

It's around 47%, so it appears to be fine.

andrewdavidwong commented 5 years ago

Check /var/log/libvirt/libxl/libxl-driver.log about the time of sys-net start that lead to this state (so, the failed one, before the message from this issue title). Do you see anything similar to #3322 ?

I've been monitoring /var/log/libvirt/libxl/libxl-driver.log when this problem occurs. I do not see anything similar to #3322.

At the time when sys-net fails to start, I just see lines like this:

2019-06-28 09:40:27.689+0000: libxl: libxl.c:1853:libxl_console_get_tty: unable to read console tty path `/local/domain/1960/console/tty': Resource temporarily unavailable
2019-06-28 09:42:30.240+0000: libxl: libxl.c:1853:libxl_console_get_tty: unable to read console tty path `/local/domain/1961/console/tty': Resource temporarily unavailable
2019-06-28 09:43:57.060+0000: libxl: libxl.c:1853:libxl_console_get_tty: unable to read console tty path `/local/domain/1962/console/tty': Resource temporarily unavailable

One odd thing about this bug is that it consistently occurs at ~5 days of uptime. If I just always make sure to reboot dom0 before reaching ~5 days of uptime, I can consistently avoid the bug.

andrewdavidwong commented 5 years ago

Occurred again after 4d22h of uptime. This time I caught the initial error message about being unable to start sys-net:

Cannot recv data: Connection reset by peer

This led me to #990, in which @marmarek said to check /var/log/messages. Unfortunately, mine was empty.

Still nothing different in /var/log/libvirt/libxl/libxl-driver.log (same as posted above).

I don't think it's a hardware problem due to the 5d uptime consistency. I'm going to try reverting back to normal sys-net and sys-firewall now.

marmarek commented 5 years ago

Check libvirt related errors in journalctl

andrewdavidwong commented 5 years ago

Check libvirt related errors in journalctl

Yeah, I'm seeing a lot of libvirt errors in here:

journalctl.txt

marmarek commented 5 years ago

Something gone really bad there:

Jul 13 17:19:54 dom0 kernel: libvirtd[3341]: segfault at b3 ip 00000000000000b3 sp 00007ffea31f99a0 error 14 in libvirtd[5605bb8ea000+98000]
Jul 13 17:19:54 dom0 kernel: Code: Bad RIP value.
Jul 13 17:19:54 dom0 systemd[1]: Started Process Core Dump (PID 25968/UID 0).
Jul 13 17:19:55 dom0 systemd[1]: libvirtd.service: Main process exited, code=dumped, status=11/SEGV
Jul 13 17:19:55 dom0 systemd[1]: libvirtd.service: Unit entered failed state.
Jul 13 17:19:55 dom0 systemd[1]: libvirtd.service: Failed with result 'core-dump'.
Jul 13 17:19:56 dom0 systemd-coredump[25969]: Process 3341 (libvirtd) of user 0 dumped core.

                                              Stack trace of thread 3341:
                                              #0  0x00000000000000b3 n/a (n/a)

Looks like some memory corruption bug within libvirt. I don't have high hopes, but could you try extract full backtrace using coredumpctl info command?

andrewdavidwong commented 5 years ago

Looks like some memory corruption bug within libvirt. I don't have high hopes, but could you try extract full backtrace using coredumpctl info command?

There is indeed a coredumpctl info entry at the same timestamp:

           PID: 3341 (libvirtd)
           UID: 0 (root)
           GID: 0 (root)
        Signal: 11 (SEGV)
     Timestamp: Sat 2019-07-13 17:19:54 CDT (10h ago)
  Command Line: /usr/sbin/libvirtd
    Executable: /usr/sbin/libvirtd
 Control Group: /system.slice/libvirtd.service
          Unit: libvirtd.service
         Slice: system.slice
       Boot ID: e0e01a54ec7a41c4a893029e6e7b5864
    Machine ID: c23c70a11dcc4184b8ea44ab38623af3
      Hostname: dom0
      Coredump: /var/lib/systemd/coredump/core.libvirtd.0.e0e01a54ec7a41c4a893029e6e7b5864.3341.1563056394000000000000.lz4
       Message: Process 3341 (libvirtd) of user 0 dumped core.

                Stack trace of thread 3341:
                #0  0x00000000000000b3 n/a (n/a)

But I guess this doesn't look like it includes a full backtrace.

andrewdavidwong commented 5 years ago

I'm going to try reverting back to normal sys-net and sys-firewall now.

This made no noticeable difference. Still experiencing the problem the same as before.

andrewdavidwong commented 5 years ago

Then follow that up with something like rpm -Va to verify all package contents in the system (maybe restrict this to qubes* packages at first).

Should I be concerned about the missing or L (mismatch occurs in a link) entries?

$ sudo rpm -Va
.M....GT.    /var/log/salt/minion
..5....T.    /usr/share/themes/Ops/xfwm4/menu-active.xpm
..5....T.    /usr/share/themes/Ops/xfwm4/menu-inactive.xpm
S.5....T.    /usr/share/themes/Ops/xfwm4/themerc
S.5....T.  c /etc/yum.conf
....L....    /usr/bin/X
......G..  c /etc/qubes-rpc/policy/whonix.GatewayCommand
......G..  c /etc/qubes-rpc/policy/whonix.NewStatus
......G..  c /etc/qubes-rpc/policy/whonix.SdwdateStatus
S.5....T.  c /etc/qubes-rpc/policy/qubes.InputMouse
S.5....T.    /etc/xdg/xfce4/xfconf/xfce-perchannel-xml/xfce4-session.xml
.......T.    /lib/modules/4.19.43-1.pvops.qubes.x86_64/modules.builtin.bin
.......T.    /lib/modules/4.19.43-1.pvops.qubes.x86_64/modules.devname
.......T.    /lib/modules/4.19.43-1.pvops.qubes.x86_64/modules.softdep
SM5....T.  c /etc/qubes-rpc/policy/qubes.Filecopy
.......T.  c /etc/qubes-rpc/policy/qubes.GetDate
S.5....T.  c /etc/qubes-rpc/policy/qubes.OpenInVM
S.5....T.  c /etc/qubes-rpc/policy/qubes.OpenURL
S.5....T.  c /etc/qubes-rpc/policy/qubes.StartApp
S.5....T.  c /etc/qubes-rpc/policy/qubes.UpdatesProxy
.M.....T.  c /etc/qubes-rpc/policy/qubes.VMRootShell
SM5....T.  c /etc/qubes-rpc/policy/qubes.VMShell
S.5....T.  c /etc/qubes/qmemman.conf
.M.......    /var/run/qubes
.M....G..    /etc/xen
S.5....T.  c /etc/xen/xl.conf
.M....G..    /var/lib/xen
.M....G..    /var/log/xen
.M....G..    /var/log/xen/console
S.5....T.    /etc/xdg/xfce4/xfconf/xfce-perchannel-xml/xfce4-power-manager.xml
.M.......  c /etc/grub.d/10_linux
S.5....T.  c /etc/plymouth/plymouthd.conf
....L....  c /etc/pam.d/fingerprint-auth
....L....  c /etc/pam.d/password-auth
....L....  c /etc/pam.d/postlogin
....L....  c /etc/pam.d/smartcard-auth
....L....  c /etc/pam.d/system-auth
S.5....T.    /usr/lib/python3.5/site-packages/urllib3/packages/__pycache__/six.cpython-35.pyc
.......T.  c /etc/PackageKit/PackageKit.conf
.......T.    /lib/modules/4.14.103-1.pvops.qubes.x86_64/modules.builtin.bin
.......T.    /lib/modules/4.14.103-1.pvops.qubes.x86_64/modules.devname
.......T.    /lib/modules/4.14.103-1.pvops.qubes.x86_64/modules.softdep
S.5....T.  c /etc/xdg/xfce4/xfconf/xfce-perchannel-xml/xfce4-keyboard-shortcuts.xml
S.5....T.  c /etc/xdg/xfce4/panel/default.xml
S.5....T.  c /etc/qubes-rpc/policy/qubes.Gpg
S.5....T.  c /etc/qubes-rpc/policy/qubes.GpgImportKey
..5....T.  c /etc/yum.repos.d/qubes-dom0.repo
S.5....T.  c /etc/xdg/xfce4/xfconf/xfce-perchannel-xml/xsettings.xml
..5....T.    /usr/share/themes/Xfce-dusk/gtk-2.0/gtkrc
S.5....T.  c /etc/dnf/dnf.conf
.......T.    /lib/modules/4.14.116-1.pvops.qubes.x86_64/modules.builtin.bin
.......T.    /lib/modules/4.14.116-1.pvops.qubes.x86_64/modules.devname
.......T.    /lib/modules/4.14.116-1.pvops.qubes.x86_64/modules.softdep
S.5....T.  c /etc/xscreensaver/XScreenSaver.ad.tail
missing     /var/run/pulse
S.5....T.  c /etc/qubes-rpc/policy/qubes.ClipboardPaste

andrewdavidwong commented 5 years ago

It now appears that this problem occurs as a result of activity, not time. This week, my pattern of Qubes usage has been different from usual. I'm at over 7 days of uptime, yet the problem has not occurred. My usual pattern of activity includes a lot of suspending and resuming, as well as a lot of shutting down and restarting of VMs, especially at night during scripted backups and template updates. Since I haven't been doing any of this as much over the past week, and the problem has not occurred by now, I suspect that there is something about my usual pattern of activity that triggers the problem.

DemiMarie commented 3 years ago

I wonder if this has been fixed in a newer version of libvirtd.

andrewdavidwong commented 3 years ago

I wonder if this has been fixed in a newer version of libvirtd.

I can check and report back. Which version should I have for a good test?

DemiMarie commented 3 years ago

No idea, tbh.

andrewdavidwong commented 3 years ago

Ok, I'll just keep an eye on it in general.

DemiMarie commented 3 years ago

Can you reproduce this in R4.1 with a Fedora 32 dom0?

andrewdavidwong commented 3 years ago

Can you reproduce this in R4.1 with a Fedora 32 dom0?

Sorry, I don't have a machine to test 4.1, but perhaps someone else who has experienced the issue can. Or I can report back once 4.1 is stable and I've migrated to it, but that might not be for a while.

andrewdavidwong commented 3 years ago

I can at least confirm that this is still happening on R4.0 with libvirtd 3.3.0.

DemiMarie commented 2 years ago

The original problem should have been fixed by QubesOS/qubes-core-admin#397. @andrewdavidwong if you still experience libvirtd crashes, please open a separate issue for them.

maxwell-kalin commented 2 years ago

Still happens to me, 4.1, cant use sys-usb

QubesOS / qubes-issues

ERROR: Logical Volume "vm-sys-<vm-name>-root-snap" already exists in volume group "qubes_dom0" #3964

Affected component(s):

Steps to reproduce the behavior:

Expected behavior:

Actual behavior:

General notes:

Related issues:

I get these errors on each attempted VM startup:

For VMs connected to `sys-net-deb` through `sys-firewall`:

For most VMs without network VM assigned:

lvs output:

Output for different VMs:

QubesOS / qubes-issues

ERROR: Logical Volume "vm-sys-<vm-name>-root-snap" already exists in volume group "qubes_dom0" #3964

Affected component(s):

Steps to reproduce the behavior:

Expected behavior:

Actual behavior:

General notes:

Related issues:

I get these errors on each attempted VM startup:

For VMs connected to sys-net-deb through sys-firewall:

For most VMs without network VM assigned:

lvs output:

Output for different VMs:

For VMs connected to `sys-net-deb` through `sys-firewall`: