QubesOS / qubes-issues

The Qubes OS Project issue tracker
https://www.qubes-os.org/doc/issue-tracking/
541 stars 48 forks source link

Qubes Backup hangs and never finishes if any of the target qubes are running #7411

Open scallyob opened 2 years ago

scallyob commented 2 years ago

How to file a helpful issue

Qubes OS release

4.1rc3 and 4.1 final release

Brief summary

Qubes Backup hangs after creating 30K file if qube being backed up is still running. If multiple qubes being backed up and some are running it will create a bigger file (presumably of the non-running ones) before hanging.

Steps to reproduce

  1. Create a new standalone qube based on fedora-34 called backuptest
  2. Start backuptest
  3. Run Qubes Backup
  4. select just backuptest to be backedup
  5. "Compress backup" is checked
  6. click "Next>"
  7. Select external drive and directory
  8. set password for encryption
  9. uncheck save settings as default backup profile
  10. leave turn computer off unchecked
  11. click "Next>"
  12. it warns you that it will backup the state prior to starting the qube
  13. click the button to start backup
  14. watch the file in terminal

Expected behavior

If I shutdown backuptest and run the steps above I see the progress bar move on the Backup tool until it reaches 100%. This creates a 3.7GB file that I can verify with the Restore Backup tool. I would expect these same results when backuptest is running.

Actual behavior

progress bar stays at 0% in Backup tool file never exceeds 30K and timestamp doesn't change after that in Terminal backup never completes, have to cancel the backup tool

scallyob commented 2 years ago

By the way, have been working through this last couple months before posting here: https://forum.qubes-os.org/t/backup-fails-since-upgrade-to-4-1-unless-qubes-are-all-shut-down/8660/15

Originally discovered in 4.1rc3 install, but just did a fresh 4.1 install and found it has not been fixed.

rustybird commented 2 years ago

In https://github.com/QubesOS/qubes-issues/issues/7198#issuecomment-1086674714 you wrote that you have an ext4 installation. It's intentional that the Qubes backup system will refuse to back up a running VM stored on a deprecated legacy 'file' driver pool (which was never a safe thing to do).

The unintentional part is that when the backup destination is a VM (instead of dom0) the process will indeed just hang for some reason, instead of aborting with the message "Backup error: file pool cannot export running volumes" as it should.

You may want to reinstall R4.1 with one of the three supported automatic partitioning schemes:

scallyob commented 2 years ago

I believe I did use the default. I just went through the installer and here is what I remember doing.

When I got to the page entitled "Device Selecting", I selected /dev/sda and /dev/sdb, which I intended to use as a RAID1, then I used "custom" in order to set up the RAID.

The next screen is entitled "New Qubes OS R4.1.0 Installation" I can't think of any reason why I would have changed it from the default of "LVM Thin Provisioning" there. Then I had to delete the old root partition and recreate it as type: RAID check encrypt select RAID1 then select format as ext4

When I first attempted installing the release candidate I tried manually setting up the RAID with fdisk and mdadm but couldn't figure it out. The above seemed to work and is what I repeated the other day when doing a fresh install. But from what you've said it seems likely my lack of skills in the RAID install may be the source of my problem? Does what I describe above seem like the wrong way to do it?

DemiMarie commented 2 years ago

The next screen is entitled "New Qubes OS R4.1.0 Installation" I can't think of any reason why I would have changed it from the default of "LVM Thin Provisioning" there. Then I had to delete the old root partition and recreate it as type: RAID check encrypt select RAID1 then select format as ext4

I suspect that is what the installer didn’t handle properly. You were probably hoping that the installer would create a RAID (either using LVM RAID or mdadm) and then do the normal provisioning on top of that. But the manual partitioning probably got it confused and caused it to leave out the LVM layer. That left you stuck with an ext4 filesystem without LVM, so Qubes OS had to resort to the old, crummy file pool. The result is that lots of stuff doesn’t work properly.

(I don’t think I have ever seen a good explanation for why the file pool is no good, so here is mine: The file pool uses Linux’s dm-snapshot driver to provide snapshots. However, dm-snapshot only supports a single origin and a single snapshot, so if one wants to support N revisions, one needs N dm-snapshot layers. Reflinks and LVM thin provisioning, on the other hand, don’t distinguish between origins and snapshots. Both the origin and snapshot are independent of each other, and one can have huge numbers of snapshots with only O(log N) overhead. So it is much more practical to implement nice features this way.)

When I first attempted installing the release candidate I tried manually setting up the RAID with fdisk and mdadm but couldn't figure it out. The above seemed to work and is what I repeated the other day when doing a fresh install. But from what you've said it seems likely my lack of skills in the RAID install may be the source of my problem? Does what I describe above seem like the wrong way to do it?

This is an installer bug. At the very least, the installer should emit a giant warning if it cannot create a non-deprecated pool, with the default action being to not continue.

scallyob commented 2 years ago

OK, thanks for the info @DemiMarie. So will have to wait for a new release of the installer?

Or is there a way to create a RAID setup with the current installer that won't cause this problem?

DemiMarie commented 2 years ago

OK, thanks for the info @DemiMarie. So will have to wait for a new release of the installer?

Or is there a way to create a RAID setup with the current installer that won't cause this problem?

BTRFS’s built-in RAID should work (subject to the usual BTRFS caveats: avoid RAID5 and RAID6, I/O performance can be unstable, etc), and selecting XFS or BTRFS instead of ext4 will result in a usable varlibqubes pool. If you want to use dm-raid + LVM thin provisioning I believe you will need to do it manually.

scallyob commented 2 years ago

reinstalling with BTRFS fixed the backup issue. But it did greatly increase the load, lagging of the system. Maybe the I/O performance you mention? I described it more in the forum for reference: https://forum.qubes-os.org/t/unable-to-get-fully-functional-system-when-installing-qubes-4-1-on-a-raid1/10682

DemiMarie commented 2 years ago

reinstalling with BTRFS fixed the backup issue. But it did greatly increase the load, lagging of the system. Maybe the I/O performance you mention?

It probably is. BTRFS isn’t known for being fast and VM disks are one of the worst-case scenarios for it. You could also try XFS, which might be faster.

I described it more in the forum for reference: https://forum.qubes-os.org/t/unable-to-get-fully-functional-system-when-installing-qubes-4-1-on-a-raid1/10682

I will take a look at that in a bit.