QubesOS / qubes-issues

The Qubes OS Project issue tracker
https://www.qubes-os.org/doc/issue-tracking/
536 stars 48 forks source link

Timeout starting qube on 4.2 that was restored from a 4.1 backup #8656

Open Eric678 opened 12 months ago

Eric678 commented 12 months ago

Qubes OS release

4.2.0-rc3 6.1.43

Brief summary

Transferring a specific app qube from R4.1 to R4.2 results in a qube that will not start: dom0: Cannot connect to qrexec agent for 60 seconds, see ...log: Job qubes-relabel-rw.service/start clocked up 57 seconds before the qube was killed.

Steps to reproduce

Cannot give specific instructions here as it only happened to 1 of my qubes when migrating across to 4.2. It is a largish qube ~30GB (my biggest) and so it was quite slow to trial and error, change something in /rw; backup; restore ... Did not stumble on the cause. Source template fedora-37, dest template fedora-38.

The first restore there were a few persistent block device attaches that generated warnings during restore on 4.2 and would have prevented a start in 4.1. First retry deleted those from dom0 before the backup. Perhaps there was something lingering on the destination after the original restored qube was deleted? This was the only qube that I did a full backup restore with persistent attaches.

Expected behavior

Restored qube would start normally.

Actual behavior

As above. While the relabel-rw was running it was in some sort of infinite loop with dom0. It was chalking up 75-95% CPU while dom0 had a couple of short sessions running kcryptd daemons.

My workaround was to tar up all of /rw on the source on 4.1, create a new qube on 4.2 and untar --overwrite on /rw at dest and immediately restart the qube. Much quicker than a backup restore! Worked a treat. Not had any problems with that qube since. Still have the broken qube, if anything useful can be done.

euidzero commented 11 months ago

Same here with a qube moved from fedora-37 (installed in 4.1) to fedora-38 template while on 4.2.

euidzero commented 11 months ago

Solution here:

in dom0 :

sudo mount /dev/qubes_dom0/vm-MYQUBE-private /mnt
sudo touch /mnt/.autorelabel
sudo umount /mnt

qvm-start MYQUBE
rustybird commented 11 months ago

Not sure what's going on with SELinux here, but:

sudo mount /dev/qubes_dom0/vm-MYQUBE-private /mnt

Never mount a VM volume in dom0. Do it in a DisposableVM instead (ideally based on a disposable template that uses the same TemplateVM as the VM): https://www.qubes-os.org/doc/mount-lvm-image/

macdanny commented 10 months ago

I just experienced this issue when migrating to 4.2.0.

The qube is pretty big, about 350 GB. It was created on 4.1 with a derivative of the fedora-38 template 0:4.0.6-202305200036. By derivative I mean: I leave the out of the box templates alone, I clone them and install additional software in the clone to make them fit for purpose. There's nothing special about the qube.

Since I backed up and restored everything, the template and the qube are both present in the new install. The qube worked with no issues until I switched the template to a new template deriving from fedora-38-xfce 0:4.2.0-202312171103. Then the qube wouldn't start and I saw the relabel messages in the console log.

In my case I ran this to get around the issue:

qvm-prefs --set MYQUBE qrexec_timeout 1200

The qube did eventually start. The 1200 number was arrived at after some trial and error. 5 minutes wasn't enough.

After it started, I restarted it and it subsequently started in a reasonable amount of time. I will reset the qrexec_timeout to the default and carry on.

Eric678 commented 9 months ago

In my case I was only about 3 seconds short of making it. Selinux was set to permissive, until I got around to sorting out printing. There were just under 500K items in the private volume. Guess it is just a bit slow with all items needing labeling.

Willy-JL commented 2 months ago

I restored VMs from a 4.1.2 install to a fresh 4.2.2 install and they all started correctly except 2 that failed with the same Cannot connect to qrexec agent for 60 seconds error. One of them succeeded on the second try, while the other did not even succeed after 10 minutes of timeout. Mounting the volume into a temporary VM and creating the missing .autorelabel fixed this issue. I wonder how it happened to go missing? Might be worth noting that after restoring the VMs, I started most of them nearly at the same time, so perhaps it timed out due to processing other ones before, and it could not add the .autorelabel file before the first timeout? Not sure if that makes any sense, regardless thanks for the solution.

marmarek commented 2 months ago

.autorelabel on private volume signals when SELinux relabeling was completed. If it's missing, it means relabeling wasn't completed. For a really big private storage (in number of files, not necessarily bytes) it may take a while, could be also over 10 minutes. If you created it manually without actually completing relabeling, some labels will be missing and you will run into SELinux issues sooner or later. You can do relabeling manually: /usr/sbin/restorecon -RF /rw /home /usr/local