Open ddevz opened 2 years ago
Does backup work if all of the VMs you are trying to back up are stopped? That should not be necessary, but it might point at the cause of the problem.
Does backup work if all of the VMs you are trying to back up are stopped? That should not be necessary, but it might point at the cause of the problem.
I'll try it
Does backup work if all of the VMs you are trying to back up are stopped? That should not be necessary, but it might point at the cause of the problem.
Tried it. No change.
Is there a log of what has been backed up? I tried doing a "verify" from the restore tool to see what got backed up and what did not, in order to get a hint at which VM was causing it, but that lead to a new larger concern:
New Even Larger Concern: Doing a verify says everything backed up fine, even images which couldn't possibly compress down to 15 Gigs. This could be a problem for someone who is using "verify" from the restore tool with the intention of making sure their backups are there, as they will be surprised when they actually try to restore from them.
Raising priority as it causes data loss.
My understanding is that the backup verification tool checks only for errors during the restore process; it does not check for missing data. @marmarek, is this accurate?
The verification tool should check for truncated VM images at least.
First time: 15570100 backed up (found with ls -s) Second time: 15676416 backed up (found with ls -s) Third time: 15676416 backed up (found with ls -s)
This used to work and write files 183G.
After hanging the disk activity stops.
Did you recently significantly increase some VM's storage capacity? Backing up very sparse data (e.g. a VM with hundreds of gigabytes of free space) from LVM will cause the backup system to spend most of its time seemingly doing nothing: https://forum.qubes-os.org/t/solved-backup-qubes-tarwriter-extremly-slow/7456
Did you recently significantly increase some VM's storage capacity? Backing up very sparse data (e.g. a VM with hundreds of gigabytes of free space) from LVM will cause the backup system to spend most of its time seemingly doing nothing: https://forum.qubes-os.org/t/solved-backup-qubes-tarwriter-extremly-slow/7456
That would explain it!!!! This is definitely a (UX, at least) bug.
First time: 15570100 backed up (found with ls -s) Second time: 15676416 backed up (found with ls -s) Third time: 15676416 backed up (found with ls -s) This used to work and write files 183G. After hanging the disk activity stops.
Did you recently significantly increase some VM's storage capacity? Backing up very sparse data (e.g. a VM with hundreds of gigabytes of free space) from LVM will cause the backup system to spend most of its time seemingly doing nothing: https://forum.qubes-os.org/t/solved-backup-qubes-tarwriter-extremly-slow/7456
Yes I did. After your post, I removed everything that I believe had sparse filesystems in them and it still only got to 15665920 . running top shows no resource consumption, and ps does not show any tarwriter in the process listing
Is there any way to figure out what VM it's hanging on? I'm really hoping for a log somewhere.
I found qvm-backup, so to try to debug it, instead of using the GUI, I did:
cp /etc/qubes/backup/qubes-manager-backup.conf backup-service-vms.conf
Then removed a bunch of vms from the list, then:
qvm-backup --verbose --profile ./backup-service-vms.conf
and immedately got:
Backup preparation error: Got empty response from qubesd. See journalctl in dom0 for details.
Any ideas? Obviously this is not the cause of the problem, as it happens instantly whereas the problem gets 15Gig into the backuo, but the backup preparation error is preventing me from figuring out what's going on.
journalctl | grep -i 'backup'
gives:
dom0 qubesd[3291]: permission denied for call b'admin.backup.Info'+b'./backup-service-vms.conf' (b'dom0' → b'dom0') with payload of 0 bytes
The backup profile must be passed as the basename of a .conf file under /etc/qubes/backup/
, e.g.:
cp /etc/qubes/backup/qubes-manager-backup.conf /etc/qubes/backup/foo.conf
qvm-backup --verbose --profile foo
But qvm-backup
probably won't show anything relevant even with --verbose
, because everything interesting happens within the qubesd
daemon. You'd have to enable debug logging for this daemon, and then do another backup run:
sudo systemctl edit --runtime --full qubesd
--debug
argument after /usr/bin/qubesd
, save the file, and quitsystemctl restart qubesd
journalctl -ft qubesd -g backup:
running in a terminal windowResults:
Most backups seem to look like this:
Jun 21 11:23:05 dom0 qubesd[800404]: [MainProcess backup._wrap_and_send_files:657] qubes.backup: python3 -m qubes.tarwriter --use-compress-program=gzip --override-name=vm19/private.img /dev/qubes_dom0/vm-sys-backup-private
Jun 21 11:23:08 dom0 qubesd[800404]: [MainProcess backup._split_and_send:593] qubes.backup: handle_streams returned: None
Jun 21 11:23:08 dom0 qubesd[800404]: [MainProcess backup._split_and_send:601] qubes.backup: scrypt return code: 0
Jun 21 11:23:08 dom0 qubesd[800404]: [MainProcess backup._wrap_and_send_files:612] qubes.backup: Backing up <qubes.backup.Backup.FileToBackup object at 0x730d73c6d1c0>
Jun 21 11:23:08 dom0 qubesd[800404]: [MainProcess backup._wrap_and_send_files:617] qubes.backup: Using temporary location: /tmp/tmpgsgz55e4/vm19/firewall.xml
Jun 21 11:23:08 dom0 qubesd[800404]: [MainProcess backup._wrap_and_send_files:657] qubes.backup: tar -Pc --sparse -C /var/lib/qubes/appvms/sys-backup --dereference --use-compress-program=gzip --xform=s:^firewall.xml:vm19/\0: firewall.xml
Jun 21 11:23:08 dom0 qubesd[800404]: [MainProcess backup.run:141] qubes.backup: Sending file vm19/private.img.000.enc
Jun 21 11:23:08 dom0 qubesd[800404]: [MainProcess backup.run:159] qubes.backup: Removing file vm19/private.img.000.enc
Jun 21 11:23:08 dom0 qubesd[800404]: [MainProcess backup._split_and_send:593] qubes.backup: handle_streams returned: None
Jun 21 11:23:12 dom0 qubesd[800404]: [MainProcess backup._split_and_send:601] qubes.backup: scrypt return code: 0
Jun 21 11:23:12 dom0 qubesd[800404]: [MainProcess backup._wrap_and_send_files:612] qubes.backup: Backing up <qubes.backup.Backup.FileToBackup object at 0x730d73c6d3a0>
Jun 21 11:23:12 dom0 qubesd[800404]: [MainProcess backup._wrap_and_send_files:617] qubes.backup: Using temporary location: /tmp/tmpgsgz55e4/vm20/private.img
Jun 21 11:23:12 dom0 qubesd[800404]: [MainProcess backup.run:141] qubes.backup: Sending file vm19/firewall.xml.000.enc
Jun 21 11:23:12 dom0 qubesd[800404]: [MainProcess backup.run:159] qubes.backup: Removing file vm19/firewall.xml.000.enc
However, the final backup when the hang happened, looked like this:
Jun 21 11:23:12 dom0 qubesd[800404]: [MainProcess backup._wrap_and_send_files:657] qubes.backup: python3 -m qubes.tarwriter --use-compress-program=gzip --override-name=vm20/private.img /dev/qubes_dom0/vm-qubename-private
Jun 21 11:23:22 dom0 qubesd[800404]: [MainProcess backup._split_and_send:593] qubes.backup: handle_streams returned: None
Jun 21 11:23:22 dom0 qubesd[800404]: [MainProcess backup._split_and_send:601] qubes.backup: scrypt return code: 0
Jun 21 11:23:22 dom0 qubesd[800404]: [MainProcess backup.run:141] qubes.backup: Sending file vm20/private.img.000.enc
Jun 21 11:23:22 dom0 qubesd[800404]: [MainProcess backup.run:159] qubes.backup: Removing file vm20/private.img.000.enc
Jun 21 11:23:22 dom0 qubesd[800404]: [MainProcess backup.run:162] qubes.backup: Finished sending thread
And then just hangs there.
Now to be fair, the system that it hangs on has 28G of uncompressable data (and 7G free space) on the private volume
Note that the "handle_streams returned" line showed up before the "Sending file" and "Removing file" lines for vm20/private.img , as opposed to vm19 and some others I checked.
Also, i see that it split files for some of the larger templates, so with 28G of data, i would expect it to be trying to split it too.
Also, I left the string "vm-sys-backup-private" in to draw to the point that the backup system backed up the backup qube (the qube that the backup data goes through on its way to getting saved) right before it started vm20 (in case that was relevant).
I removed that one VM from the list to back up and the backup worked. I can probably fix that one by creating a new appVM, qvm-copy the files over to that, and erase the original appVM. Assuming that process works, is there any reason to debug the original appVM to uncover what the actual problem was?
Interestingly, that appVM is one I had backed up before, and to the best of my knowledge haven't touched since the last backup.
Now to be fair, the system that it hangs on has 28G of uncompressable data
Just to confirm, you already tried creating the backup without compression, right? (That's always the first thing I try when there's a problem backing up VMs containing large amounts of incompressible data.)
@andrewdavidwong:
My understanding is that the backup verification tool checks only for errors during the restore process; it does not check for missing data. @marmarek, is this accurate?
@DemiMarie:
The verification tool should check for truncated VM images at least.
Ah, here's the issue I was thinking of:
is there any reason to debug the original appVM to uncover what the actual problem was?
If you have enough disk space you could keep that VM around, in case someone comes up with an idea how to find the bug.
One possibility would be to attach a Python debugger to qubesd, but a simpler solution might be to just run ps
. If there are gzip and/or scrypt processes hogging CPU, then this probably just means the backup is proceeding very slowly.
Just to confirm, you already tried creating the backup without compression, right? (That's always the first thing I try when there's a problem backing up VMs containing large amounts of incompressible data.)
I have now, and got the same result
One possibility would be to attach a Python debugger to qubesd, but a simpler solution might be to just run
ps
. If there are gzip and/or scrypt processes hogging CPU, then this probably just means the backup is proceeding very slowly.
There are not. there is no "gzip" or "tar" strings (other then strings containing "start" or "target") in the results of "ps aux" when it is locked up.
Nothing related to backup seems to be using the cpu when it's locked up. I'm guessing that the "Finished sending thread" log message means the tar/gzip process has ended?
Looks like i found the root cause!
trying to qvm-copy files over to the new VM generated the error: file-agent: Fatal error: Copying file Error reading (error type: Input/output error)
with dmesg in the VM: blk_update_request: I/O error, dev xvdb, sector 10452344 op 0x0:(READ) flags 0x80700 phys_seg 32 prio class 0
and dom0 dmesg: blk_update_request: critical medium error, dev nvme0n1
implying the nvme is bad.
So we should be able to reduce this to some UX issues like:
Actually, instead of just changing the documentation for qubes-verify to make it clear it doesn't actually verify it was written, could we make qubes-verify actually verify the hash of the backup?
@ddevz Thanks for tracking this down! An I/O error can cause all sorts of problems, but a hang should not be one of them unless something else goes wrong. This may well happen, as most software is not tested for its handling of I/O errors. Nevertheless, backup verification really should verify that the backup was actually written, as restoring a backup is time-consuming and an untested backup provides a false sense of security.
Maybe the I/O error even causes an exception, but something in the backup system swallows it like in https://github.com/QubesOS/qubes-issues/issues/7411#issuecomment-1086780297:
The unintentional part is that when the backup destination is a VM (instead of dom0) the process will indeed just hang for some reason, instead of aborting with the message "Backup error: file pool cannot export running volumes" as it should.
Thanks for tracking this down!
Sure thing.
Another thought. Does qubes have a way to notify the user of S.M.A.R.T. /disk errors?
There is a package "smart-notifier" that might be able to do it, but it says "only for gtk", and I believe qubes is xfce.
of course that would be adding yet another package to dom0 that you'd have to worry about the security of.
Another thought. Does qubes have a way to notify the user of S.M.A.R.T. /disk errors?
There is a package "smart-notifier" that might be able to do it, but it says "only for gtk", and I believe qubes is xfce.
Just use smartctl
, e.g. with conky, or hack some short script with smartctl
and notify-send
yourself.
I'm not sure whether I suffered from this particular bug when upgrading from Qubes 4.0 to 4.1.1 (I already lost +200GiB of data, including over 2 years worth of work in some projects. I found out too late when the fresh install was already done over the old one), but it's closely related:
I made a +900GiB backup on a 3.5" SATA hard drive attached to a USB 3.0 adapter using the sys-usb VM and mounted on a disposable VM. I knew it would take several hours so I unchecked the compress option to save CPU. When I checked next day the Qubes backup tool showed a success message, but I noticed the hard drive motor was stopped (that hard drive never stops its motor when it's connected to an internal SATA port). The backup file was slightly under 800GiB, but at that moment it seemed reasonable because, despite not being compressed, actual written sparse zeroes on some VMs could have been left out. I made the risky decision of not verifying the backup's integrity, as it would have required a similar amount of hours, and I was overly confident in my computer forensic skills in the event of damaged sectors.
At the restore stage an I/O error popped out and half of my VMs showed 0 bytes of Disk Usage in the Qube Manager. When doing an emergency recovery I found out that all of those 0-byte VMs had an "Unexpected EOF" error in all of their chunks when decrypting them with scrypt. One of the VMs' chunks were readable until the 490'th.
My (unproven) theory is that at some point either the disposable VM the hard drive was mounted on or sys-usb sent it the message to sleep (thus stopping the motor) and that caused an I/O error when trying to write past the partial VM's 490'th chunk.
In order to prevent further data losses I would suggest comparing the expected uncompressed size of the logical volumes to the actual written bytes before compressing their images by piping dd
or wc --bytes
to tar or gz's input.
@AlbertGoma please file a new issue for this.
@DemiMarie Done: #7797
I can confirm this issue on Qubes R4.1 with USBVM setup.
When doing a backup, I had hardware issues. The harddisk become too hot.
The VM having the external USB drive mounted (dom-backup
App Qube) showed in journal:
blk_update_request: I/O error, dev xvdi, sector 52408064 op 0x1:(WRITE) flags 0x4000 phys_seg 32 prio class 0
blk_update_request: I/O error, dev xvdi, sector 52408320 op 0x1:(WRITE) flags 0x0 phys_seg 29 prio class 0
EXT4-fs warning (device dm-0): ext4_end_bio:344: I/O error 10 writing to inode 11 starting block 6546752)
blk_update_request: I/O error, dev xvdi, sector 52408552 op 0x1:(WRITE) flags 0x4000 phys_seg 32 prio class 0
blk_update_request: I/O error, dev xvdi, sector 52408808 op 0x1:(WRITE) flags 0x4000 phys_seg 32 prio class 0
blk_update_request: I/O error, dev xvdi, sector 52409064 op 0x1:(WRITE) flags 0x4000 phys_seg 32 prio class 0
blk_update_request: I/O error, dev xvdi, sector 52409320 op 0x1:(WRITE) flags 0x4000 phys_seg 32 prio class 0
blk_update_request: I/O error, dev xvdi, sector 52409576 op 0x1:(WRITE) flags 0x4000 phys_seg 32 prio class 0
blk_update_request: I/O error, dev xvdi, sector 52409832 op 0x1:(WRITE) flags 0x4000 phys_seg 32 prio class 0
blk_update_request: I/O error, dev xvdi, sector 52410088 op 0x1:(WRITE) flags 0x4000 phys_seg 32 prio class 0
blk_update_request: I/O error, dev xvdi, sector 52410344 op 0x1:(WRITE) flags 0x0 phys_seg 32 prio class 0
EXT4-fs warning (device dm-0): ext4_end_bio:344: I/O error 10 writing to inode 11 starting block 6547229)
Setting capacity to 0
xvdi: detected capacity change from 976707617 to 0
EXT4-fs warning (device dm-0): ext4_end_bio:344: I/O error 10 writing to inode 11 starting block 6547712)
EXT4-fs warning (device dm-0): ext4_end_bio:344: I/O error 10 writing to inode 11 starting block 6548735)
EXT4-fs warning (device dm-0): ext4_end_bio:344: I/O error 10 writing to inode 11 starting block 6549423)
EXT4-fs warning (device dm-0): ext4_end_bio:344: I/O error 10 writing to inode 11 starting block 6550528)
EXT4-fs warning (device dm-0): ext4_end_bio:344: I/O error 10 writing to inode 11 starting block 6551360)
EXT4-fs warning (device dm-0): ext4_end_bio:344: I/O error 10 writing to inode 11 starting block 6552576)
EXT4-fs warning (device dm-0): ext4_end_bio:344: I/O error 10 writing to inode 11 starting block 6553491)
EXT4-fs warning (device dm-0): ext4_end_bio:344: I/O error 10 writing to inode 11 starting block 6493184)
Aborting journal on device dm-0-8.
EXT4-fs error (device dm-0) in ext4_reserve_inode_write:5779: Journal has aborted
EXT4-fs error (device dm-0) in ext4_reserve_inode_write:5779: Journal has aborted
EXT4-fs error (device dm-0): ext4_convert_unwritten_extents:4826: inode #11: comm kworker/u4:2: mark_inode_dirty error
EXT4-fs error (device dm-0): ext4_dirty_inode:5983: inode #11: comm dd: mark_inode_dirty error
EXT4-fs error (device dm-0) in ext4_convert_unwritten_io_end_vec:4865: Journal has aborted
EXT4-fs error (device dm-0) in ext4_dirty_inode:5984: Journal has aborted
EXT4-fs (dm-0): failed to convert unwritten extents to written extents -- potential data loss! (inode 11, error -30)
EXT4-fs error (device dm-0): ext4_journal_check_start:83: comm dd: Detected aborted journal
Buffer I/O error on dev dm-0, logical block 60850176, lost sync page write
JBD2: Error -5 detected when updating journal superblock for dm-0-8.
EXT4-fs error (device dm-0): ext4_journal_check_start:83: comm kworker/u4:2: Detected aborted journal
Buffer I/O error on dev dm-0, logical block 0, lost sync page write
EXT4-fs (dm-0): previous I/O error to superblock detected
Buffer I/O error on dev dm-0, logical block 0, lost sync page write
EXT4-fs (dm-0): I/O error while writing superblock
EXT4-fs (dm-0): Remounting filesystem read-only
EXT4-fs (dm-0): Remounting filesystem read-only
EXT4-fs (dm-0): failed to convert unwritten extents to written extents -- potential data loss! (inode 11, error -30)
EXT4-fs (dm-0): failed to convert unwritten extents to written extents -- potential data loss! (inode 11, error -30)
EXT4-fs (dm-0): failed to convert unwritten extents to written extents -- potential data loss! (inode 11, error -30)
EXT4-fs (dm-0): failed to convert unwritten extents to written extents -- potential data loss! (inode 11, error -30)
EXT4-fs (dm-0): failed to convert unwritten extents to written extents -- potential data loss! (inode 11, error -30)
Buffer I/O error on device dm-0, logical block 6546973
Buffer I/O error on device dm-0, logical block 6546974
Buffer I/O error on device dm-0, logical block 6546975
Buffer I/O error on device dm-0, logical block 6546976
Buffer I/O error on device dm-0, logical block 6546977
Buffer I/O error on device dm-0, logical block 6546978
Buffer I/O error on device dm-0, logical block 6546979
Buffer I/O error on device dm-0, logical block 6546980
Buffer I/O error on device dm-0, logical block 6546981
Buffer I/O error on device dm-0, logical block 6546982
Buffer I/O error on dev dm-0, logical block 0, lost sync page write
Buffer I/O error on dev dm-0, logical block 6291465, lost async page write
Meanwhile Qubes dom0 qvm-backup --verbose
was just hanging and did not notice that.
Maybe to reproduce it would be simplest to just run a backup and then unplug the external USB backup disk while the backup is running.
Qubes OS release
4.1
Brief summary
Qubes Backup now hangs after writing 15Gigs: First time: 15570100 backed up (found with ls -s) Second time: 15676416 backed up (found with ls -s) Third time: 15676416 backed up (found with ls -s)
This used to work and write files 183G.
After hanging the disk activity stops.
I"m guessing that its hanging when it gets to a particular VM. Is there a debug log for qubes-backup somewhere to look for errors?
Steps to reproduce
This happens reliably and has happened 3 out of 3 sequential backup attempts.
I see similar symptoms, (but a different cause because he was not using LVM, lvdisplay shows I am using LVM) in the report at: https://github.com/QubesOS/qubes-issues/issues/7411