abbbi / virtnbdbackup

Backup utility for Libvirt / qemu / kvm supporting incremental and differential backups + instant recovery (agentless).
http://libvirtbackup.grinser.de/
GNU General Public License v3.0
338 stars 46 forks source link

Backup Job Fails to Abort When libvirtd Daemon Restarts or Disconnects #205

Closed inode64 closed 2 months ago

inode64 commented 2 months ago

Version used 2.11

Describe the bug When performing a backup, if the libvirtd daemon restarts or disconnects unexpectedly, the backup job fails to complete and throws an error. The process is unable to stop the backup task due to a broken connection, resulting in a "Broken pipe" error. The application does not handle this scenario correctly, leading to an unhandled exception and terminating the process.

Steps to Reproduce:

Expected behavior

The backup process should be able to detect the disconnection of the libvirtd daemon and handle it gracefully by reconnecting and attempting to abort the job. Ideally, the application should attempt to restore the connection to libvirtd and properly abort the backup task, instead of throwing an unhandled exception.

Hypervisor information:

Logfiles: [2024-08-20 01:33:55] INFO lib common - printVersion [main]: Version: 2.11 Arguments: /usr/lib/python-exec/python3.11/virtnbdbackup -d svrdynamics --noprogress -q --compress -l auto -o /srv/backup/svrdynamics/2024-33 [2024-08-20 01:33:55] INFO root virtnbdbackup - main [main]: Backup level: [auto] [2024-08-20 01:33:55] INFO root virtnbdbackup - main [main]: Compression enabled, level [2] [2024-08-20 01:33:56] INFO root virtnbdbackup - main [main]: Libvirt library version: [10003000] [2024-08-20 01:33:56] INFO root virtnbdbackup - main [main]: NBD library version: [1.18.1] [2024-08-20 01:33:56] INFO root check - targetDir [main]: Backup mode auto, target folder is empty: executing full backup. [2024-08-20 01:33:56] INFO root disktype - Optical [main]: Skipping attached [cdrom] device: [sdb]. 2024-08-20 01:33:56] INFO root virtnbdbackup - main [main]: Backup will save [4] attached disks. 2024-08-20 01:33:56] INFO root virtnbdbackup - main [main]: Concurrent backup processes: [4] 2024-08-20 01:33:56] INFO root checkpoint - create [main]: Loading checkpoints from: [/srv/backup/svrdynamics/2024-33/svrdynamics.cpt] 2024-08-20 01:33:56] INFO root checkpoint - redefine [main]: Loading checkpoint list from: [/srv/backup/svrdynamics/2024-33/checkpoints] 2024-08-20 01:33:56] INFO root checkpoint - create [main]: Checkpoint handling. 2024-08-20 01:34:02] INFO root checkpoint - create [main]: Using checkpoint name: [virtnbdbackup.0]. 2024-08-20 01:34:02] INFO root virtnbdbackup - main [main]: Local NBD Endpoint socket: [/var/tmp/virtnbdbackup.2045415] 2024-08-20 01:34:02] INFO root virtnbdbackup - main [main]: Temporary scratch file target directory: [/var/tmp] 2024-08-20 01:34:02] INFO root job - start [main]: Starting backup job. 2024-08-20 01:34:33] INFO fs fs - freeze [main]: Freezed [5] filesystems. 2024-08-20 01:34:34] INFO fs fs - thaw [main]: Thawed [5] filesystems. 2024-08-20 01:34:34] INFO root virtnbdbackup - main [main]: Started backup job with checkpoint, saving information. 2024-08-20 01:34:34] INFO root checkpoint - backup [main]: Saving checkpoint config to: [/srv/backup/svrdynamics/2024-33/checkpoints/virtnbdbackup.0.xml] 2024-08-20 01:34:34] INFO nbd client - connect [vda]: Waiting until NBD server at [nbd+unix:///vda?socket=/var/tmp/virtnbdbackup.2045415] is up. 2024-08-20 01:34:34] INFO nbd client - connect [vdb]: Waiting until NBD server at [nbd+unix:///vdb?socket=/var/tmp/virtnbdbackup.2045415] is up. 2024-08-20 01:34:34] INFO nbd client - connect [vdc]: Waiting until NBD server at [nbd+unix:///vdc?socket=/var/tmp/virtnbdbackup.2045415] is up. 2024-08-20 01:34:34] INFO nbd client - connect [sda]: Waiting until NBD server at [nbd+unix:///sda?socket=/var/tmp/virtnbdbackup.2045415] is up. 2024-08-20 01:34:35] INFO nbd client - connect [vda]: Connection to NBD backend succeeded. 2024-08-20 01:34:35] INFO nbd client - connect [vdb]: Connection to NBD backend succeeded. 2024-08-20 01:34:35] INFO root disk - _getExtentHandler [vda]: Using qemu tools to query extents 2024-08-20 01:34:35] INFO root disk - _getExtentHandler [vdb]: Using qemu tools to query extents 2024-08-20 01:34:35] INFO nbd client - connect [vdc]: Connection to NBD backend succeeded. 2024-08-20 01:34:35] INFO nbd client - connect [sda]: Connection to NBD backend succeeded. 2024-08-20 01:34:35] INFO root disk - _getExtentHandler [vdc]: Using qemu tools to query extents 2024-08-20 01:34:35] INFO root disk - _getExtentHandler [sda]: Using qemu tools to query extents 2024-08-20 01:34:35] INFO root disk - backup [vda]: Got 31 extents to backup. 2024-08-20 01:34:35] INFO root disk - backup [vda]: 53687091200 bytes disk size 2024-08-20 01:34:35] INFO root disk - backup [vda]: 823189504 bytes of data extents to backup 2024-08-20 01:34:35] INFO root target - get [vda]: Write data to target file: [/srv/backup/svrdynamics/2024-33/vda.full.data.partial]. 2024-08-20 01:34:35] INFO root disk - backup [vda]: Creating thin provisioned stream backup image 2024-08-20 01:34:36] INFO root disk - backup [sda]: Got 357 extents to backup. 2024-08-20 01:34:36] INFO root disk - backup [sda]: 128849018880 bytes disk size 2024-08-20 01:34:36] INFO root disk - backup [sda]: 128540803072 bytes of data extents to backup 2024-08-20 01:34:36] INFO root target - get [sda]: Write data to target file: [/srv/backup/svrdynamics/2024-33/sda.full.data.partial]. 2024-08-20 01:34:36] INFO root disk - backup [sda]: Creating thin provisioned stream backup image 2024-08-20 01:34:36] INFO root disk - backup [vdc]: Got 123 extents to backup. 2024-08-20 01:34:36] INFO root disk - backup [vdc]: 107374182400 bytes disk size 2024-08-20 01:34:36] INFO root disk - backup [vdc]: 65519288320 bytes of data extents to backup 2024-08-20 01:34:36] INFO root target - get [vdc]: Write data to target file: [/srv/backup/svrdynamics/2024-33/vdc.full.data.partial]. 2024-08-20 01:34:36] INFO root disk - backup [vdc]: Creating thin provisioned stream backup image 2024-08-20 01:34:37] INFO root disk - backup [vdb]: Got 17 extents to backup. 2024-08-20 01:34:37] INFO root disk - backup [vdb]: 214748364800 bytes disk size 2024-08-20 01:34:37] INFO root disk - backup [vdb]: 214353313792 bytes of data extents to backup 2024-08-20 01:34:37] INFO root target - get [vdb]: Write data to target file: [/srv/backup/svrdynamics/2024-33/vdb.full.data.partial]. 2024-08-20 01:34:37] INFO root disk - backup [vdb]: Creating thin provisioned stream backup image 2024-08-20 01:34:42] INFO root disk - backup [vda]: Backup of disk [vda] finished, file: [/srv/backup/svrdynamics/2024-33/vda.full.data] 2024-08-20 01:34:42] INFO root metadata - backupChecksum [vda]: Checksum for file: [/srv/backup/svrdynamics/2024-33/vda.full.data]:[948353423] 2024-08-20 01:34:42] INFO root metadata - backupChecksum [vda]: Saving checksum to: [/srv/backup/svrdynamics/2024-33/vda.full.data.chksum] 2024-08-20 01:49:32] INFO root disk - backup [vdc]: Backup of disk [vdc] finished, file: [/srv/backup/svrdynamics/2024-33/vdc.full.data] 2024-08-20 01:49:32] INFO root metadata - backupChecksum [vdc]: Checksum for file: [/srv/backup/svrdynamics/2024-33/vdc.full.data]:[1331928550] 2024-08-20 01:49:32] INFO root metadata - backupChecksum [vdc]: Saving checksum to: [/srv/backup/svrdynamics/2024-33/vdc.full.data.chksum] 2024-08-20 02:02:42] INFO root disk - backup [sda]: Backup of disk [sda] finished, file: [/srv/backup/svrdynamics/2024-33/sda.full.data] 2024-08-20 02:02:42] INFO root metadata - backupChecksum [sda]: Checksum for file: [/srv/backup/svrdynamics/2024-33/sda.full.data]:[496616561] 2024-08-20 02:02:46] INFO root metadata - backupChecksum [sda]: Saving checksum to: [/srv/backup/svrdynamics/2024-33/sda.full.data.chksum] 2024-08-20 02:14:24] INFO root disk - backup [vdb]: Backup of disk [vdb] finished, file: [/srv/backup/svrdynamics/2024-33/vdb.full.data] 2024-08-20 02:14:24] INFO root metadata - backupChecksum [vdb]: Checksum for file: [/srv/backup/svrdynamics/2024-33/vdb.full.data]:[3927655927] 2024-08-20 02:14:24] INFO root metadata - backupChecksum [vdb]: Saving checksum to: [/srv/backup/svrdynamics/2024-33/vdb.full.data.chksum] 2024-08-20 02:14:24] INFO root virtnbdbackup - main [main]: Backup jobs finished, stopping backup task. 2024-08-20 02:14:24] WARNING virt client - stopBackup [main]: Failed to stop backup job: [Cannot write data: Broken pipe] 2024-08-20 02:14:24] INFO root metadata - backupConfig [main]: Saving VM config to: [/srv/backup/svrdynamics/2024-33/vmconfig.virtnbdbackup.0.xml] 2024-08-20 02:14:24] INFO root metadata - backupBootConfig [main]: Save additional boot config [loader] to: [/srv/backup/svrdynamics/2024-33/edk2-x86_64-code.fd.virtnbdbackup.0] 2024-08-20 02:14:24] INFO root metadata - backupBootConfig [main]: Save additional boot config [nvram] to: [/srv/backup/svrdynamics/2024-33/svrdynamics_VARS.fd.virtnbdbackup.0] 2024-08-20 02:14:24] INFO root metadata - backupDiskInfo [main]: Saved qcow image config to: [/srv/backup/svrdynamics/2024-33/vda.virtnbdbackup.0.qcow.json] 2024-08-20 02:14:24] INFO root metadata - backupDiskInfo [main]: Saved qcow image config to: [/srv/backup/svrdynamics/2024-33/vdb.virtnbdbackup.0.qcow.json] 2024-08-20 02:14:24] INFO root metadata - backupDiskInfo [main]: Saved qcow image config to: [/srv/backup/svrdynamics/2024-33/vdc.virtnbdbackup.0.qcow.json] 2024-08-20 02:14:24] INFO root metadata - backupDiskInfo [main]: Saved qcow image config to: [/srv/backup/svrdynamics/2024-33/sda.virtnbdbackup.0.qcow.json] Traceback (most recent call last): File "/usr/lib/python-exec/python3.11/virtnbdbackup", line 493, in main() File "/usr/lib/python-exec/python3.11/virtnbdbackup", line 471, in main if domObj.autostart() == 1: ^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/site-packages/libvirt.py", line 741, in autostart raise libvirtError('virDomainGetAutostart() failed') libvirt.libvirtError: internal error: client socket is closed

abbbi commented 2 months ago

What happens to the backup job anyway if the libvirtd daemon is restarted? It seems the Qemu process still keeping the NDB handle open, otherwise data transfer would fail. Is the backup job still listed in domjobs output after a libvrtd restart? Really a corner case here, i dont know if want to invest time to fix this..

Problem to fix this is that the libvirt connection is initialized at application startup, there is probably no easy way to re-connect without re-designing a large part of the code base. One way to solve this would probably be a second thread which checks for the connection state and re-initializes it if its broken, but then this is still prone to race conditions.

Another one would be to slap an decorator on each function within the virt/client.py which re-initializes a connection if it is broken, or redesign the module at all that for each operation a new connection is opened (which isnt that great,too). But then you would need to handle this gracefully too, what if libvirtd doesnt come up for several minutes or even hours?)

I dont see any options to the libvirt flags upon connection opening that would introduce an automatic reconnect.

abbbi commented 2 months ago
>>> import libvirt
>>> foo = libvirt.open("qemu:///session")              # restart libvirt daemon at this point
>>> foo.isAlive()
1                                                      # wouldnt expect that
>>> foo.getSysinfo()
libvirt: XML-RPC error : Cannot write data: Broken pipe
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3/dist-packages/libvirt.py", line 4742, in getSysinfo
    raise libvirtError('virConnectGetSysinfo() failed')
libvirt.libvirtError: Cannot write data: Broken pipe
>>> foo.isAlive()
0                                                   # connection drops alive once an error is happening.
abbbi commented 2 months ago

registerCloseCallback looks interesting.

abbbi commented 2 months ago
>>> def callback(foo,bar,baz):
...     print("help..")
... 
>>> 
>>> 
>>> foo = libvirt.open("qemu:///session")
>>> foo.registerCloseCallback(callback, None)
0
>>> foo.isAlive()
1
>>> foo.getSysinfo()
help..
libvirt: XML-RPC error : Cannot write data: Broken pipe
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3/dist-packages/libvirt.py", line 4742, in getSysinfo
    raise libvirtError('virConnectGetSysinfo() failed')
libvirt.libvirtError: Cannot write data: Broken pipe
abbbi commented 2 months ago

hi,

can you check branch issue_205 with its commits. in my tests it then attempts to re-connect if connection error handle is called due to issues with libvirt backend:

[2024-08-21 11:55:23] ERROR root virtnbdbackup - connectionError [main]:  Libvirt connection error detected, reconnecting
[2024-08-21 11:55:23] ERROR root virtnbdbackup - connectionError [main]:  Attempting to stop backup task
inode64 commented 2 months ago

Ok, thank you very much, all I can do is try it in a test installation and let you know.

abbbi commented 2 months ago

this one made it through:

https://github.com/abbbi/virtnbdbackup/actions/runs/10487766874/artifacts/1836691428

inode64 commented 2 months ago

It runs fine for me!! Thanks

I have created a pull to include retries #206