Qubes Backup hangs if there is an I/O error

ddevz commented 2 years ago

Qubes OS release

4.1

Brief summary

Qubes Backup now hangs after writing 15Gigs: First time: 15570100 backed up (found with ls -s) Second time: 15676416 backed up (found with ls -s) Third time: 15676416 backed up (found with ls -s)

This used to work and write files 183G.

After hanging the disk activity stops.

I"m guessing that its hanging when it gets to a particular VM. Is there a debug log for qubes-backup somewhere to look for errors?

Steps to reproduce

This happens reliably and has happened 3 out of 3 sequential backup attempts.

I see similar symptoms, (but a different cause because he was not using LVM, lvdisplay shows I am using LVM) in the report at: https://github.com/QubesOS/qubes-issues/issues/7411

DemiMarie commented 2 years ago

Does backup work if all of the VMs you are trying to back up are stopped? That should not be necessary, but it might point at the cause of the problem.

ddevz commented 2 years ago

Does backup work if all of the VMs you are trying to back up are stopped? That should not be necessary, but it might point at the cause of the problem.

I'll try it

ddevz commented 2 years ago

Does backup work if all of the VMs you are trying to back up are stopped? That should not be necessary, but it might point at the cause of the problem.

Tried it. No change.

Is there a log of what has been backed up? I tried doing a "verify" from the restore tool to see what got backed up and what did not, in order to get a hint at which VM was causing it, but that lead to a new larger concern:

New Even Larger Concern: Doing a verify says everything backed up fine, even images which couldn't possibly compress down to 15 Gigs. This could be a problem for someone who is using "verify" from the restore tool with the intention of making sure their backups are there, as they will be surprised when they actually try to restore from them.

DemiMarie commented 2 years ago

Raising priority as it causes data loss.

andrewdavidwong commented 2 years ago

My understanding is that the backup verification tool checks only for errors during the restore process; it does not check for missing data. @marmarek, is this accurate?

DemiMarie commented 2 years ago

The verification tool should check for truncated VM images at least.

rustybird commented 2 years ago

First time: 15570100 backed up (found with ls -s) Second time: 15676416 backed up (found with ls -s) Third time: 15676416 backed up (found with ls -s)

This used to work and write files 183G.

After hanging the disk activity stops.

Did you recently significantly increase some VM's storage capacity? Backing up very sparse data (e.g. a VM with hundreds of gigabytes of free space) from LVM will cause the backup system to spend most of its time seemingly doing nothing: https://forum.qubes-os.org/t/solved-backup-qubes-tarwriter-extremly-slow/7456

DemiMarie commented 2 years ago

Did you recently significantly increase some VM's storage capacity? Backing up very sparse data (e.g. a VM with hundreds of gigabytes of free space) from LVM will cause the backup system to spend most of its time seemingly doing nothing: https://forum.qubes-os.org/t/solved-backup-qubes-tarwriter-extremly-slow/7456

That would explain it!!!! This is definitely a (UX, at least) bug.

ddevz commented 2 years ago

First time: 15570100 backed up (found with ls -s) Second time: 15676416 backed up (found with ls -s) Third time: 15676416 backed up (found with ls -s) This used to work and write files 183G. After hanging the disk activity stops.

Did you recently significantly increase some VM's storage capacity? Backing up very sparse data (e.g. a VM with hundreds of gigabytes of free space) from LVM will cause the backup system to spend most of its time seemingly doing nothing: https://forum.qubes-os.org/t/solved-backup-qubes-tarwriter-extremly-slow/7456

Yes I did. After your post, I removed everything that I believe had sparse filesystems in them and it still only got to 15665920 . running top shows no resource consumption, and ps does not show any tarwriter in the process listing

Is there any way to figure out what VM it's hanging on? I'm really hoping for a log somewhere.

ddevz commented 2 years ago

I found qvm-backup, so to try to debug it, instead of using the GUI, I did:

cp /etc/qubes/backup/qubes-manager-backup.conf backup-service-vms.conf Then removed a bunch of vms from the list, then: qvm-backup --verbose --profile ./backup-service-vms.conf and immedately got: Backup preparation error: Got empty response from qubesd. See journalctl in dom0 for details.

Any ideas? Obviously this is not the cause of the problem, as it happens instantly whereas the problem gets 15Gig into the backuo, but the backup preparation error is preventing me from figuring out what's going on.

journalctl | grep -i 'backup' gives: dom0 qubesd[3291]: permission denied for call b'admin.backup.Info'+b'./backup-service-vms.conf' (b'dom0' → b'dom0') with payload of 0 bytes

rustybird commented 2 years ago

The backup profile must be passed as the basename of a .conf file under /etc/qubes/backup/, e.g.:

cp /etc/qubes/backup/qubes-manager-backup.conf /etc/qubes/backup/foo.conf
qvm-backup --verbose --profile foo

But qvm-backup probably won't show anything relevant even with --verbose, because everything interesting happens within the qubesd daemon. You'd have to enable debug logging for this daemon, and then do another backup run:

sudo systemctl edit --runtime --full qubesd
In the editor, add a --debug argument after /usr/bin/qubesd, save the file, and quit
systemctl restart qubesd
Keep journalctl -ft qubesd -g backup: running in a terminal window
Start the backup

ddevz commented 2 years ago

Results:

Most backups seem to look like this:

Jun 21 11:23:05 dom0 qubesd[800404]: [MainProcess backup._wrap_and_send_files:657] qubes.backup: python3 -m qubes.tarwriter --use-compress-program=gzip --override-name=vm19/private.img /dev/qubes_dom0/vm-sys-backup-private
Jun 21 11:23:08 dom0 qubesd[800404]: [MainProcess backup._split_and_send:593] qubes.backup: handle_streams returned: None
Jun 21 11:23:08 dom0 qubesd[800404]: [MainProcess backup._split_and_send:601] qubes.backup: scrypt return code: 0
Jun 21 11:23:08 dom0 qubesd[800404]: [MainProcess backup._wrap_and_send_files:612] qubes.backup: Backing up <qubes.backup.Backup.FileToBackup object at 0x730d73c6d1c0>
Jun 21 11:23:08 dom0 qubesd[800404]: [MainProcess backup._wrap_and_send_files:617] qubes.backup: Using temporary location: /tmp/tmpgsgz55e4/vm19/firewall.xml
Jun 21 11:23:08 dom0 qubesd[800404]: [MainProcess backup._wrap_and_send_files:657] qubes.backup: tar -Pc --sparse -C /var/lib/qubes/appvms/sys-backup --dereference --use-compress-program=gzip --xform=s:^firewall.xml:vm19/\0: firewall.xml
Jun 21 11:23:08 dom0 qubesd[800404]: [MainProcess backup.run:141] qubes.backup: Sending file vm19/private.img.000.enc
Jun 21 11:23:08 dom0 qubesd[800404]: [MainProcess backup.run:159] qubes.backup: Removing file vm19/private.img.000.enc
Jun 21 11:23:08 dom0 qubesd[800404]: [MainProcess backup._split_and_send:593] qubes.backup: handle_streams returned: None
Jun 21 11:23:12 dom0 qubesd[800404]: [MainProcess backup._split_and_send:601] qubes.backup: scrypt return code: 0
Jun 21 11:23:12 dom0 qubesd[800404]: [MainProcess backup._wrap_and_send_files:612] qubes.backup: Backing up <qubes.backup.Backup.FileToBackup object at 0x730d73c6d3a0>
Jun 21 11:23:12 dom0 qubesd[800404]: [MainProcess backup._wrap_and_send_files:617] qubes.backup: Using temporary location: /tmp/tmpgsgz55e4/vm20/private.img
Jun 21 11:23:12 dom0 qubesd[800404]: [MainProcess backup.run:141] qubes.backup: Sending file vm19/firewall.xml.000.enc
Jun 21 11:23:12 dom0 qubesd[800404]: [MainProcess backup.run:159] qubes.backup: Removing file vm19/firewall.xml.000.enc

However, the final backup when the hang happened, looked like this:

Jun 21 11:23:12 dom0 qubesd[800404]: [MainProcess backup._wrap_and_send_files:657] qubes.backup: python3 -m qubes.tarwriter --use-compress-program=gzip --override-name=vm20/private.img /dev/qubes_dom0/vm-qubename-private
Jun 21 11:23:22 dom0 qubesd[800404]: [MainProcess backup._split_and_send:593] qubes.backup: handle_streams returned: None
Jun 21 11:23:22 dom0 qubesd[800404]: [MainProcess backup._split_and_send:601] qubes.backup: scrypt return code: 0
Jun 21 11:23:22 dom0 qubesd[800404]: [MainProcess backup.run:141] qubes.backup: Sending file vm20/private.img.000.enc
Jun 21 11:23:22 dom0 qubesd[800404]: [MainProcess backup.run:159] qubes.backup: Removing file vm20/private.img.000.enc
Jun 21 11:23:22 dom0 qubesd[800404]: [MainProcess backup.run:162] qubes.backup: Finished sending thread

And then just hangs there.

Now to be fair, the system that it hangs on has 28G of uncompressable data (and 7G free space) on the private volume

Note that the "handle_streams returned" line showed up before the "Sending file" and "Removing file" lines for vm20/private.img , as opposed to vm19 and some others I checked.

Also, i see that it split files for some of the larger templates, so with 28G of data, i would expect it to be trying to split it too.

Also, I left the string "vm-sys-backup-private" in to draw to the point that the backup system backed up the backup qube (the qube that the backup data goes through on its way to getting saved) right before it started vm20 (in case that was relevant).

ddevz commented 2 years ago

I removed that one VM from the list to back up and the backup worked. I can probably fix that one by creating a new appVM, qvm-copy the files over to that, and erase the original appVM. Assuming that process works, is there any reason to debug the original appVM to uncover what the actual problem was?

Interestingly, that appVM is one I had backed up before, and to the best of my knowledge haven't touched since the last backup.

andrewdavidwong commented 2 years ago

Now to be fair, the system that it hangs on has 28G of uncompressable data

Just to confirm, you already tried creating the backup without compression, right? (That's always the first thing I try when there's a problem backing up VMs containing large amounts of incompressible data.)

andrewdavidwong commented 2 years ago

@andrewdavidwong:

My understanding is that the backup verification tool checks only for errors during the restore process; it does not check for missing data. @marmarek, is this accurate?

@DemiMarie:

The verification tool should check for truncated VM images at least.

Ah, here's the issue I was thinking of:

https://github.com/QubesOS/qubes-issues/issues/3498

rustybird commented 2 years ago

is there any reason to debug the original appVM to uncover what the actual problem was?

If you have enough disk space you could keep that VM around, in case someone comes up with an idea how to find the bug.

DemiMarie commented 2 years ago

One possibility would be to attach a Python debugger to qubesd, but a simpler solution might be to just run ps. If there are gzip and/or scrypt processes hogging CPU, then this probably just means the backup is proceeding very slowly.

ddevz commented 2 years ago

Just to confirm, you already tried creating the backup without compression, right? (That's always the first thing I try when there's a problem backing up VMs containing large amounts of incompressible data.)

I have now, and got the same result

One possibility would be to attach a Python debugger to qubesd, but a simpler solution might be to just run ps. If there are gzip and/or scrypt processes hogging CPU, then this probably just means the backup is proceeding very slowly.

There are not. there is no "gzip" or "tar" strings (other then strings containing "start" or "target") in the results of "ps aux" when it is locked up.

Nothing related to backup seems to be using the cpu when it's locked up. I'm guessing that the "Finished sending thread" log message means the tar/gzip process has ended?

ddevz commented 2 years ago

Looks like i found the root cause!

trying to qvm-copy files over to the new VM generated the error: file-agent: Fatal error: Copying file Error reading (error type: Input/output error)

with dmesg in the VM: blk_update_request: I/O error, dev xvdb, sector 10452344 op 0x0:(READ) flags 0x80700 phys_seg 32 prio class 0

and dom0 dmesg: blk_update_request: critical medium error, dev nvme0n1

implying the nvme is bad.

So we should be able to reduce this to some UX issues like:

It is not obvious that in qubes, doing a "verify" on a backup does not actually check the hash of the backup to verify it was written correctly (which is how I believe it works in other software, and what I expected)
It would be helpful if qubes backup GUI displayed the VM it was working on at any point in time so the user can gauge which VM is causing their speed problems (or on rare occasion their lockup problem)

Actually, instead of just changing the documentation for qubes-verify to make it clear it doesn't actually verify it was written, could we make qubes-verify actually verify the hash of the backup?

DemiMarie commented 2 years ago

@ddevz Thanks for tracking this down! An I/O error can cause all sorts of problems, but a hang should not be one of them unless something else goes wrong. This may well happen, as most software is not tested for its handling of I/O errors. Nevertheless, backup verification really should verify that the backup was actually written, as restoring a backup is time-consuming and an untested backup provides a false sense of security.

rustybird commented 2 years ago

Maybe the I/O error even causes an exception, but something in the backup system swallows it like in https://github.com/QubesOS/qubes-issues/issues/7411#issuecomment-1086780297:

The unintentional part is that when the backup destination is a VM (instead of dom0) the process will indeed just hang for some reason, instead of aborting with the message "Backup error: file pool cannot export running volumes" as it should.

ddevz commented 2 years ago

Thanks for tracking this down!

Sure thing.

Another thought. Does qubes have a way to notify the user of S.M.A.R.T. /disk errors?

There is a package "smart-notifier" that might be able to do it, but it says "only for gtk", and I believe qubes is xfce.

of course that would be adding yet another package to dom0 that you'd have to worry about the security of.

3hhh commented 2 years ago

Another thought. Does qubes have a way to notify the user of S.M.A.R.T. /disk errors?

There is a package "smart-notifier" that might be able to do it, but it says "only for gtk", and I believe qubes is xfce.

Just use smartctl, e.g. with conky, or hack some short script with smartctl and notify-send yourself.

AlbertGoma commented 2 years ago

I'm not sure whether I suffered from this particular bug when upgrading from Qubes 4.0 to 4.1.1 (I already lost +200GiB of data, including over 2 years worth of work in some projects. I found out too late when the fresh install was already done over the old one), but it's closely related:

I made a +900GiB backup on a 3.5" SATA hard drive attached to a USB 3.0 adapter using the sys-usb VM and mounted on a disposable VM. I knew it would take several hours so I unchecked the compress option to save CPU. When I checked next day the Qubes backup tool showed a success message, but I noticed the hard drive motor was stopped (that hard drive never stops its motor when it's connected to an internal SATA port). The backup file was slightly under 800GiB, but at that moment it seemed reasonable because, despite not being compressed, actual written sparse zeroes on some VMs could have been left out. I made the risky decision of not verifying the backup's integrity, as it would have required a similar amount of hours, and I was overly confident in my computer forensic skills in the event of damaged sectors.

At the restore stage an I/O error popped out and half of my VMs showed 0 bytes of Disk Usage in the Qube Manager. When doing an emergency recovery I found out that all of those 0-byte VMs had an "Unexpected EOF" error in all of their chunks when decrypting them with scrypt. One of the VMs' chunks were readable until the 490'th.

My (unproven) theory is that at some point either the disposable VM the hard drive was mounted on or sys-usb sent it the message to sleep (thus stopping the motor) and that caused an I/O error when trying to write past the partial VM's 490'th chunk.

In order to prevent further data losses I would suggest comparing the expected uncompressed size of the logical volumes to the actual written bytes before compressing their images by piping dd or wc --bytes to tar or gz's input.

DemiMarie commented 2 years ago

@AlbertGoma please file a new issue for this.

AlbertGoma commented 2 years ago

@DemiMarie Done: #7797

adrelanos commented 1 year ago

I can confirm this issue on Qubes R4.1 with USBVM setup.

When doing a backup, I had hardware issues. The harddisk become too hot.

The VM having the external USB drive mounted (dom-backup App Qube) showed in journal:

 blk_update_request: I/O error, dev xvdi, sector 52408064 op 0x1:(WRITE) flags 0x4000 phys_seg 32 prio class 0
 blk_update_request: I/O error, dev xvdi, sector 52408320 op 0x1:(WRITE) flags 0x0 phys_seg 29 prio class 0
 EXT4-fs warning (device dm-0): ext4_end_bio:344: I/O error 10 writing to inode 11 starting block 6546752)
 blk_update_request: I/O error, dev xvdi, sector 52408552 op 0x1:(WRITE) flags 0x4000 phys_seg 32 prio class 0
 blk_update_request: I/O error, dev xvdi, sector 52408808 op 0x1:(WRITE) flags 0x4000 phys_seg 32 prio class 0
 blk_update_request: I/O error, dev xvdi, sector 52409064 op 0x1:(WRITE) flags 0x4000 phys_seg 32 prio class 0
 blk_update_request: I/O error, dev xvdi, sector 52409320 op 0x1:(WRITE) flags 0x4000 phys_seg 32 prio class 0
 blk_update_request: I/O error, dev xvdi, sector 52409576 op 0x1:(WRITE) flags 0x4000 phys_seg 32 prio class 0
 blk_update_request: I/O error, dev xvdi, sector 52409832 op 0x1:(WRITE) flags 0x4000 phys_seg 32 prio class 0
 blk_update_request: I/O error, dev xvdi, sector 52410088 op 0x1:(WRITE) flags 0x4000 phys_seg 32 prio class 0
 blk_update_request: I/O error, dev xvdi, sector 52410344 op 0x1:(WRITE) flags 0x0 phys_seg 32 prio class 0
 EXT4-fs warning (device dm-0): ext4_end_bio:344: I/O error 10 writing to inode 11 starting block 6547229)
 Setting capacity to 0
 xvdi: detected capacity change from 976707617 to 0
 EXT4-fs warning (device dm-0): ext4_end_bio:344: I/O error 10 writing to inode 11 starting block 6547712)
 EXT4-fs warning (device dm-0): ext4_end_bio:344: I/O error 10 writing to inode 11 starting block 6548735)
 EXT4-fs warning (device dm-0): ext4_end_bio:344: I/O error 10 writing to inode 11 starting block 6549423)
 EXT4-fs warning (device dm-0): ext4_end_bio:344: I/O error 10 writing to inode 11 starting block 6550528)
 EXT4-fs warning (device dm-0): ext4_end_bio:344: I/O error 10 writing to inode 11 starting block 6551360)
 EXT4-fs warning (device dm-0): ext4_end_bio:344: I/O error 10 writing to inode 11 starting block 6552576)
 EXT4-fs warning (device dm-0): ext4_end_bio:344: I/O error 10 writing to inode 11 starting block 6553491)
 EXT4-fs warning (device dm-0): ext4_end_bio:344: I/O error 10 writing to inode 11 starting block 6493184)
 Aborting journal on device dm-0-8.
 EXT4-fs error (device dm-0) in ext4_reserve_inode_write:5779: Journal has aborted
 EXT4-fs error (device dm-0) in ext4_reserve_inode_write:5779: Journal has aborted
 EXT4-fs error (device dm-0): ext4_convert_unwritten_extents:4826: inode #11: comm kworker/u4:2: mark_inode_dirty error
 EXT4-fs error (device dm-0): ext4_dirty_inode:5983: inode #11: comm dd: mark_inode_dirty error
 EXT4-fs error (device dm-0) in ext4_convert_unwritten_io_end_vec:4865: Journal has aborted
 EXT4-fs error (device dm-0) in ext4_dirty_inode:5984: Journal has aborted
 EXT4-fs (dm-0): failed to convert unwritten extents to written extents -- potential data loss!  (inode 11, error -30)
 EXT4-fs error (device dm-0): ext4_journal_check_start:83: comm dd: Detected aborted journal
 Buffer I/O error on dev dm-0, logical block 60850176, lost sync page write
 JBD2: Error -5 detected when updating journal superblock for dm-0-8.
 EXT4-fs error (device dm-0): ext4_journal_check_start:83: comm kworker/u4:2: Detected aborted journal
 Buffer I/O error on dev dm-0, logical block 0, lost sync page write
 EXT4-fs (dm-0): previous I/O error to superblock detected
 Buffer I/O error on dev dm-0, logical block 0, lost sync page write
 EXT4-fs (dm-0): I/O error while writing superblock
 EXT4-fs (dm-0): Remounting filesystem read-only
 EXT4-fs (dm-0): Remounting filesystem read-only
 EXT4-fs (dm-0): failed to convert unwritten extents to written extents -- potential data loss!  (inode 11, error -30)
 EXT4-fs (dm-0): failed to convert unwritten extents to written extents -- potential data loss!  (inode 11, error -30)
 EXT4-fs (dm-0): failed to convert unwritten extents to written extents -- potential data loss!  (inode 11, error -30)
 EXT4-fs (dm-0): failed to convert unwritten extents to written extents -- potential data loss!  (inode 11, error -30)
 EXT4-fs (dm-0): failed to convert unwritten extents to written extents -- potential data loss!  (inode 11, error -30)
 Buffer I/O error on device dm-0, logical block 6546973
 Buffer I/O error on device dm-0, logical block 6546974
 Buffer I/O error on device dm-0, logical block 6546975
 Buffer I/O error on device dm-0, logical block 6546976
 Buffer I/O error on device dm-0, logical block 6546977
 Buffer I/O error on device dm-0, logical block 6546978
 Buffer I/O error on device dm-0, logical block 6546979
 Buffer I/O error on device dm-0, logical block 6546980
 Buffer I/O error on device dm-0, logical block 6546981
 Buffer I/O error on device dm-0, logical block 6546982
 Buffer I/O error on dev dm-0, logical block 0, lost sync page write
 Buffer I/O error on dev dm-0, logical block 6291465, lost async page write

Meanwhile Qubes dom0 qvm-backup --verbose was just hanging and did not notice that.

Maybe to reproduce it would be simplest to just run a backup and then unplug the external USB backup disk while the backup is running.

QubesOS / qubes-issues