restore failed with more than one disk - NDB error

jammerhund commented 6 months ago

Version used Provide output of virtnbdbackup -V 2.1

Describe the bug The restore of the first disk ends successfully. The restore of the second disk ends up with an NBD error.

Expected behavior Successfully restore of both disks (qcow2 files).

Hypervisor information:

OS: Debian 12.5
HV type plain libvirt

Logfiles:

[2024-03-20 21:17:12] INFO root image - create [main]: Create virtual disk [/data/kvm/foobar/foobar_disk2.qcow2] format: [qcow2] size: [10737418240] based on:
 [virtnbdbackup.1]
[2024-03-20 21:17:12] INFO root image - getConfig [main]: Using QCOW options from backup file: [/data/virtnbdbackup/foobar/sdb.virtnbdbackup.1.qcow.json]
[2024-03-20 21:17:12] INFO root server - setup [main]: Starting local NBD server on socket: [/var/tmp/virtnbdbackup.208875]
[2024-03-20 21:17:12] INFO nbd client - printVersion [main]: libnbd version: 1.14.2
[2024-03-20 21:17:12] INFO root server - setup [main]: Started NBD server, PID: [209128]
[2024-03-20 21:17:12] INFO nbd client - connect [main]: Waiting until NBD server at [nbd+unix:///sdb?socket=/var/tmp/virtnbdbackup.208875] is up.
[2024-03-20 21:17:13] INFO nbd client - connect [main]: Waiting for NBD Server, Retry: 0
[2024-03-20 21:17:14] INFO nbd client - connect [main]: Waiting for NBD Server, Retry: 1
[2024-03-20 21:17:15] INFO nbd client - connect [main]: Waiting for NBD Server, Retry: 2
[2024-03-20 21:17:16] INFO nbd client - connect [main]: Waiting for NBD Server, Retry: 3
[2024-03-20 21:17:17] INFO nbd client - connect [main]: Waiting for NBD Server, Retry: 4
[2024-03-20 21:17:18] INFO nbd client - connect [main]: Waiting for NBD Server, Retry: 5
[2024-03-20 21:17:19] INFO nbd client - connect [main]: Waiting for NBD Server, Retry: 6
[2024-03-20 21:17:20] INFO nbd client - connect [main]: Waiting for NBD Server, Retry: 7
[2024-03-20 21:17:21] INFO nbd client - connect [main]: Waiting for NBD Server, Retry: 8
[2024-03-20 21:17:22] INFO nbd client - connect [main]: Waiting for NBD Server, Retry: 9
[2024-03-20 21:17:23] INFO nbd client - connect [main]: Waiting for NBD Server, Retry: 10
[2024-03-20 21:17:24] INFO nbd client - connect [main]: Waiting for NBD Server, Retry: 11
[2024-03-20 21:17:25] INFO nbd client - connect [main]: Waiting for NBD Server, Retry: 12
[2024-03-20 21:17:26] INFO nbd client - connect [main]: Waiting for NBD Server, Retry: 13
[2024-03-20 21:17:27] INFO nbd client - connect [main]: Waiting for NBD Server, Retry: 14
[2024-03-20 21:17:28] INFO nbd client - connect [main]: Waiting for NBD Server, Retry: 15
[2024-03-20 21:17:29] INFO nbd client - connect [main]: Waiting for NBD Server, Retry: 16
[2024-03-20 21:17:30] INFO nbd client - connect [main]: Waiting for NBD Server, Retry: 17
[2024-03-20 21:17:31] INFO nbd client - connect [main]: Waiting for NBD Server, Retry: 18
[2024-03-20 21:17:32] INFO nbd client - connect [main]: Waiting for NBD Server, Retry: 19
Traceback (most recent call last):
  File "/usr/bin/virtnbdrestore", line 267, in <module>
    main()
  File "/usr/bin/virtnbdrestore", line 253, in main
    restConfig = disk.restore(args, ConfigFile, virtClient)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/libvirtnbdbackup/restore/disk.py", line 98, in restore
    connection = server.start(args, meta["diskName"], targetFile, virtClient)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/libvirtnbdbackup/restore/server.py", line 59, in start
    return setup(args, diskName, targetFile, virtClient)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/libvirtnbdbackup/restore/server.py", line 53, in setup
    return nbdClient.connect()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/libvirtnbdbackup/nbdcli/client.py", line 91, in connect
    raise exceptions.NbdConnectionTimeout(
libvirtnbdbackup.nbdcli.exceptions.NbdConnectionTimeout: Timeout during connection to NBD server backend.

Workaround: Restore in two steps with the --disk parameter works fine.

abbbi commented 6 months ago

hi,

I tried with virtual machines up to 4 disks, restoring multiple times and it works .. if it is reproducible for you, can you execute with -vn option and supply full logfile? It appears the qemu-nbd process to fill the qcow files during restore is started but doesnt accept any connections? Or it fails some time after start...

jammerhund commented 6 months ago

Hi,

thank you for your response. Yes i can reproduce this problem. Here is the log.

virtnbdrestore_01.log

abbbi commented 6 months ago

main]: Started [qemu-nbd] process: [processInfo(pid=224816, logFile='/tmp/qemu-nbdep3138ds.log', err='', out='', pidFile='')] [2024-03-21 16:52:52] INFO nbd client - printVersion [main]: libnbd version: 1.14.2

Anything in that Logfile?

i think this is probably a Race condition where the first qemu-nbd process is Not yet finished writing .. is the restore Target storage slow?

jammerhund commented 6 months ago

Hi,

the logfile is empty. The restore source is a SAS-HDD RAID5 and the target is a NVMe RAID1. With dd and oflag=direct i get 3,5 GB/s. A restore from HDD to the same HDD end up in the same error. Can i debug/trace the start and termination of the qemu-nbd processes?

abbbi commented 6 months ago

so what happens during restore is for each disk:

1) an qemu-nbd process is started which provides an NBD server 2) data is written to the NBD server 3) as soon as the NBD server has received all data it ends itself.

steps are repeated for next disk, using the same socket file.

I guess that maybe between step 2 and 3, there may be still some action to be taken on the nbd server that takes a little longer and blocks the socket file for the next disk.. But its strange because after each disk processed i call the disconnect() function for the NBD server process which should make qemu-nbd process end.

Ive just pushed this commit:

https://github.com/abbbi/virtnbdbackup/commit/1a1bac613643f9f990ed95dab96c9c28cd2764d4

maybe it calling flush() on the handle helps. Its hard to fix if i cant reproduce. Can you retry with current master git and debug loglevel again?

abbbi commented 6 months ago

whats strange is that the behavior shouldnt happen because qemu-nbd would fail if another process would still use the socket:

qemu-nbd --discard=unmap --format=qcow2 -x sda /tmp/test.qcow2 -k /tmp/test --fork
qemu-nbd: Failed to blk_new_open '/tmp/test.qcow2': Failed to get "write" lock
Is another process using the image [/tmp/test.qcow2]?

and the function which starts the qemu-nbd process for the second disk would catch this error.. In the log its clear that the process starts with exit code 0:

[2024-03-21 16:52:52] DEBUG libvirtnbdbackup.qemu.command command - run [main]: CMD: qemu-nbd --discard=unmap --format=qcow2 -x sdb /data/restored/foobar/foobar_d.qcow2 -k /var/tmp/virtnbdbackup.224619 --fork
[2024-03-21 16:52:52] DEBUG libvirtnbdbackup.qemu.command command - run [main]: Return code: 0

is apparmor aktive on the system? Maybe this is another apparmor sheaningan..

One way to diagnose would be to check during the timeout of 20 seconds while it retries:

INFO nbd client - connect [main]: Waiting for NBD Server, Retry: 0

if the qemu-nbd process started as reported in the logfile is running and if it is reachable maybe by using nbdinfo and trying to connect. Latest master will report the PID correctly now and also passes --pid-file argument to qemu-nbd process that is started.

jammerhund commented 6 months ago

Thank you very much! With the latest master the restore of both disks runs.

abbbi commented 6 months ago

thanks for confirming!

abbbi / virtnbdbackup

restore failed with more than one disk - NDB error #163