Closed jammerhund closed 6 months ago
hi,
I tried with virtual machines up to 4 disks, restoring multiple times and it works .. if it is reproducible for you, can you execute with -vn option and supply full logfile? It appears the qemu-nbd process to fill the qcow files during restore is started but doesnt accept any connections? Or it fails some time after start...
Hi,
thank you for your response. Yes i can reproduce this problem. Here is the log.
main]: Started [qemu-nbd] process: [processInfo(pid=224816, logFile='/tmp/qemu-nbdep3138ds.log', err='', out='', pidFile='')] [2024-03-21 16:52:52] INFO nbd client - printVersion [main]: libnbd version: 1.14.2
Anything in that Logfile?
i think this is probably a Race condition where the first qemu-nbd process is Not yet finished writing .. is the restore Target storage slow?
Hi,
the logfile is empty. The restore source is a SAS-HDD RAID5 and the target is a NVMe RAID1. With dd and oflag=direct i get 3,5 GB/s. A restore from HDD to the same HDD end up in the same error. Can i debug/trace the start and termination of the qemu-nbd processes?
so what happens during restore is for each disk:
1) an qemu-nbd process is started which provides an NBD server 2) data is written to the NBD server 3) as soon as the NBD server has received all data it ends itself.
steps are repeated for next disk, using the same socket file.
I guess that maybe between step 2 and 3, there may be still some action to be taken on the nbd server that takes a little longer and blocks the socket file for the next disk.. But its strange because after each disk processed i call the disconnect() function for the NBD server process which should make qemu-nbd process end.
Ive just pushed this commit:
https://github.com/abbbi/virtnbdbackup/commit/1a1bac613643f9f990ed95dab96c9c28cd2764d4
maybe it calling flush() on the handle helps. Its hard to fix if i cant reproduce. Can you retry with current master git and debug loglevel again?
whats strange is that the behavior shouldnt happen because qemu-nbd would fail if another process would still use the socket:
qemu-nbd --discard=unmap --format=qcow2 -x sda /tmp/test.qcow2 -k /tmp/test --fork
qemu-nbd: Failed to blk_new_open '/tmp/test.qcow2': Failed to get "write" lock
Is another process using the image [/tmp/test.qcow2]?
and the function which starts the qemu-nbd process for the second disk would catch this error.. In the log its clear that the process starts with exit code 0:
[2024-03-21 16:52:52] DEBUG libvirtnbdbackup.qemu.command command - run [main]: CMD: qemu-nbd --discard=unmap --format=qcow2 -x sdb /data/restored/foobar/foobar_d.qcow2 -k /var/tmp/virtnbdbackup.224619 --fork
[2024-03-21 16:52:52] DEBUG libvirtnbdbackup.qemu.command command - run [main]: Return code: 0
is apparmor aktive on the system? Maybe this is another apparmor sheaningan..
One way to diagnose would be to check during the timeout of 20 seconds while it retries:
INFO nbd client - connect [main]: Waiting for NBD Server, Retry: 0
if the qemu-nbd process started as reported in the logfile is running and if it is reachable maybe by using nbdinfo and trying to connect. Latest master will report the PID correctly now and also passes --pid-file argument to qemu-nbd process that is started.
Thank you very much! With the latest master the restore of both disks runs.
thanks for confirming!
Version used Provide output of
virtnbdbackup -V
2.1Describe the bug The restore of the first disk ends successfully. The restore of the second disk ends up with an NBD error.
Expected behavior Successfully restore of both disks (qcow2 files).
Hypervisor information:
Logfiles:
Workaround: Restore in two steps with the --disk parameter works fine.