wnbd driver and qemu-nbd - DriveIO lockup and corruption

cloudbase / wnbd

Windows Ceph RBD NBD driver

GNU Lesser General Public License v2.1

60 stars 26 forks source link

wnbd driver and qemu-nbd - DriveIO lockup and corruption #63

Closed hgkamath closed 1 year ago

hgkamath commented 2 years ago

I was doing some testing on WIN10-21H2-19044-1415/WNBD-0.2.2-4-g10c1fbe qemu-6.2.0-rc4/ExFAT/qcow2/NTFS

For complete details, see https://gitlab.com/qemu-project/qemu/-/issues/727#note_780895802

qemu-nbd is used to make a virtual-disk residing on a disk on host available to the host as a drive via nbd protocol.
WNBD is used to attach the nbd to Windows-10 as a drive.
about 85GiB of Data is copied into the drive.
A VHDX virtual-disk will choke during the copy. Copy fails midway and host becomes unresponsive
A qcow2 virtual-disk will result in corrupted data

On Linux, only the VHDX format corrupts during expansion, while qcow2 and other virtual-disk formats seem fine. But on Windows, I found that even the qcow2 format too can corrupt during expansion (TEST E). For this reason, I think this is a separate distinct bug from the original and wnbd may also be a factor. The windows only bug is in addition to the qemu-only bug, which is also lurking in there.

The bug is reproducible. A script for synthetic data generation is provided in an earlier comment in that issue filing (link).

It is possible that this is a qemu-nbd-windows only bug, but then the wnbd driver is also involved here. The semi-official driver was test-signed and seemed to be still under development. I guess the bug needs to be squashed by troubleshooting and closing in on it from both sides.

The 5 test-cases can also be part of wnbd testing.

Q) Is there are recommended version of qemu-nbd that is known to work reliably with WNBD and the two together have been well tested?

petrutlucian94 commented 2 years ago

Hi,

Thank you for your bug report. From the link that you have provided, I understand that you've tested multiple nbd clients (wnbd as well as the Linux native nbd client), multiple qemu-nbd versions and image formats.

Since you're hitting the same issue with the Linux nbd client, I doubt this is a wnbd bug. I'd suggest narrowing down the problem, even taking nbd out of the picture completely. In your Qemu bug report, you're stating that some image formats are affected while some are not. You can verify this by directly attaching the images to Qemu vms, which would exercise the same Qemu block drivers, without actually using NBD.

My first assumption was there may be unflushed changes, but since you're unmounting the volumes before closing the nbd connection, that shouldn't be the case. It may be interesting to do a wnbd-client stats $mapping before removing the nbd connection though.

Q) Is there are recommended version of qemu-nbd that is known to work reliably with WNBD and the two together have been well tested?

wnbd provides a generic NBD client, the qemu-nbd version shouldn't really matter. In case of a protocol incompatibility, the connection would be dropped and the log would contain some meaningful error.

By the way, WNBD-6.12.39.253 isn't the actual wnbd version, you may use wnbd-client version to retrieve it.

Regards, Lucian

hgkamath commented 2 years ago

Thank-you for your reply.

Trying the qemu-VM from the inside is a good suggestion. qemu-system-x86_64 is very flaky under WHPX (windows hypervisor HyperV), perhaps I should consider setting it off. The qemu-vm is more reliable with Intel-haxm hypervisor. qemu tools like qemu-img and qemu-nbd are smaller independent binaries that do not engage qemu's virtualization/emulation, just the qemu-block framework. I will try it to do it some time, and see if corruption happens that way.

In Windows qemu-setup (https://qemu.weilnetz.de/), the qemu-nbd.exe binary started getting built and included with the shipped qemu-setup since 2021-05-05, qemu-version 6.0. So this windows qemu nbd-server is a relatively recent addition to the set of available nbd peer/server/clients. qemu-w64-setup-20210505.exe

qemu-nbd is a local nbd-client. Both SSD and HDD are connected to mini-SATA. Hence read/write speeds are much higher than what one may get over the network. resmon.exe claims 70 million bytes per sec though the qemu-nbd.exe process. Ordinary network speeds are often lower than native bus speeds, even though they claim 1gbps ethernet, consumer network traffic seldom really reaches the speed of the system-pci-express-bus. At this native machine speed, there is bound to be bottlenecks and buffer-fill/overfill.

The xcopy process is reading files from HDD, at HDD read-speeds
The xcopy process is writing to K: (mapped drive)
The Windows-NTFS driver gives the io-writes to K: to the underlying disk which is managed by the WNBD SCSI Virtual Adapter
The WNBD SCSI Virtual Adapter sends these to the qemu-nbd NBD server, using the TCP/IP sockets and network-stack
The qemu-nbd has to ask the underlying Windows/ExFAT on SSD to preallocate blocks to the virtual-disk file
The Windows-ExFAT driver has to increase block allocation of the virtual-disk-file on ExFAT-volume, and update FS metadata on SSD
The qemu-nbd has to write the received blocks to the virtual-disk-file on ExFAT-volume.
The Windows-ExFAT driver writes the blocks to the underlying SSD disk on SATA AHCI controller on the PCIE root complex.
Every in/out of disk reads, is being page cached. I have 16GB of RAM. Page cache file is on a SSD partition, exclusively used for page-cache (G:). G: is an ExFAT volume that slightly over 16GB in size. Page cache is configured to be fixed-size of 16GB=16384MB. Windows would probably warn if it was running low on page-cache, no such warnings, and none of the usual signs of low page cache thrashing. In fact, because the machine is disk-io blocked, no thrashing is occurring.

Description of what is meant by when the machine is stuck: When the stuck happens, it is lucky that some open GUI processes work/respond at all. I keep a few extra privileged cmd.exe pwsh.exe windows open. Any other process that makes any disk access goes into a presumably io-blocked "not responding state". It is not possible to invoke new programs. Hence, one must be very careful with the processes that do respond. As long as no disk access is made, processes such as resmon.exe, taskmgr.exe continue to refresh and display. GUI components like menus, selections work. Text in notepad.exe can be editted but if attempted to be saved will cause it to go into not-responding. CPU usage is cool, typing, menu GUI, mouse movements are normal. Those on-screen applications that work are responsive. The only program that works is the cmd/pwsh windows. Attempting to change directory with cd will also make the command shell become stuck. It does not matter which hard-disk is being accessed, SSD1, HDD1, HDD2 etc.

I think the reason why wnbd_client.exe can be made to run from the cmd.exe prompt is because its dll, exe images are memory cached, and so shell is not disk-io blocked. However, other than stat, uninstall driver other commands have a very very long time of getting a response from the wnbd-driver.

qemu, qcow2 and kernel-nbd are well tested and work pretty rock-solid-ly under Linux. This is the other main reason why this issue is most likely a separate bug (bug2-63) as its effect is seen in different pathway, and affect any type of virtual-disk-format including qcow2. In the original bug (bug1-727), only the virtual-disk-format VHDX is affected, everything else is fine.

Uninstalling and reinstalling the driver is the only way to unstuck the situation. At this point, all not-responding processes/apps come back alive. Until I discovered this, I was only force shutting down laptop. the following don't work ctrl-C or taskmgr-end-task on qemu-nbd or xcopy, attempt wnd-client unmap. The taskmgr will accept the end-task the first time on the io-blocked not-responding process, but will be unable to truly terminate it. If the taskmgr end-task attempted again on a previously end-tasked process, a popup dialog "action not allowed" is shown.

E:\vstorage\nbd\wnbd_client>wnbd-client.exe uninstall-driver
libwnbd.dll!RemoveAllWnbdDisks INFO Hard removing WNBD disk: gkpics01
libwnbd.dll!RemoveWnbdAdapterDevice INFO Removing WNBD adapter device. Hardware id: root\wnbd. Class GUID: {4D36E97B-E325-11CE-BFC1-08002BE10318}
libwnbd.dll!CleanDrivers INFO Removing WNBD driver: oem30.inf

E:\vstorage\nbd\wnbd_client>wnbd-client.exe install-driver E:\vstorage\nbd\wnbd_driver\wnbd.inf
libwnbd.dll!WnbdInstallDriver INFO Cleaned up stale WNBD adapters.
libwnbd.dll!InstallDriver INFO Installing WNBD driver: E:\vstorage\nbd\wnbd_driver\wnbd.inf
libwnbd.dll!CreateWnbdAdapter INFO Creating WNBD adapter. Class: SCSIAdapter. Hardware id: root\wnbd.

This also hints to me that it is the wnbd driver that is freezing the disk-accesses of the system. The desktop is logged-in as an unprivileged non-administrative user. The qemu-nbd.exe is an unprivileged process started/run in an unprivileged shell. Under normal use, while running, it is easily stopped with ctrl-C or taskmgr. It can't really do anything to bring the whole system down.

Summary of why bug2-63 is separate from bug1-727 and why it might be WNBD and not qemu-nbd.exe

any virtual-disk format is affected (even qcow2, not just VHDX)
The whole local-machine with unrelated process is stuck, disk-io frozen (an unprivileged process cannot do this)
uninstalling, installing the driver using the wnbd-client, unstucks the system
qemu-nbd is a local nbd-server. Other previous use-cases of wnbd may have been true remote network-based NBD-servers, that may not push wnbd-driver and local machine windows-kernel to block-transfer speed limits.

Miscellaneous troubleshooting logs

PS E:\vstorage\nbd\wnbd_client> .\wnbd-client.exe list-opt
LogLevel           : 1 (Default: 1)
NewMappingsAllowed : true (Default: true)
EtwLoggingEnabled  : true (Default: true)
WppLoggingEnabled  : false (Default: false)
DbgPrintEnabled    : true (Default: true)
PS E:\vstorage\nbd\wnbd_client> .\wnbd-client.exe set-opt LogLevel 4
PS E:\vstorage\nbd\wnbd_client> .\wnbd-client.exe list-opt
LogLevel           : 4 (Default: 1)
NewMappingsAllowed : true (Default: true)
EtwLoggingEnabled  : true (Default: true)
WppLoggingEnabled  : false (Default: false)
DbgPrintEnabled    : true (Default: true)

PS E:\vstorage\nbd\wnbd_client> logman delete "WNBD_tracing_session"
The command completed successfully.
PS E:\vstorage\nbd\wnbd_client> logman create trace "WNBD_tracing_session" -p "{E35EAF83-0F07-418A-907C-141CD200F252}" "0xffffffff" "0xff" -o e:\tmpq\TraceFile_2.etl -rt

still trying to capture some logs.

Miscellaneous trouble shooting

qemu-w64-setup-20210505.exe (6.0.0)
qemu-w64-setup-20210619.exe (6.0.0) <- i used this
qemu-w64-setup-20211215.exe (6.2.0) <- earlier I had tried this

Below are the files one needs, and one can just extract these exe/dlls from the qemu-setup and cephpacific-setup without doing a full install

PS E:\vstorage\nbd\qemu-nbd-6.0> (Get-ChildItem).Name
iconv.dll
libbz2-1.dll
libcurl-4.dll
libffi-6.dll
libglib-2.0-0.dll
libgmp-10.dll
libgnutls-30.dll
libhogweed-4.dll
libidn2-0.dll
libintl-8.dll
libnettle-6.dll
libnghttp2-14.dll
libp11-kit-0.dll
libpcre-1.dll
libssp-0.dll
libtasn1-6.dll
libunistring-2.dll
libwinpthread-1.dll
libzstd-1.dll
qemu-nbd.exe
zlib1.dll

PS E:\vstorage\nbd\wnbd_client> (Get-ChildItem).Name
libwnbd.dll
wnbd-client.exe

PS E:\vstorage\nbd\wnbd_driver> (Get-ChildItem).Name
wnbd.cat
wnbd.inf
wnbd.sys

below are some wnbdclient stat captures. Usually the numbers OutstandingIO, pendingSubmittedIO and TotalReceivedIO add up. But start diverging after the stuck happens. Thereafter, the wnbd-client collecting stat responses is not longer quick but takes quiet some time.

PS E:\vstorage\nbd\wnbd_client> ./wnbd-client.exe stats --instance-name gkpics01
Disk stats
TotalReceivedIORequests        : 4645
TotalSubmittedIORequests       : 4645
TotalReceivedIOReplies         : 4645
UnsubmittedIORequests          : 0
PendingSubmittedIORequests     : 0
AbortedSubmittedIORequests     : 0
AbortedUnsubmittedIORequests   : 0
CompletedAbortedIORequests     : 0
OutstandingIOCount             : 0

When stuck happens,

resmon.exe shows that qemu-nbd.exe write-thoughput drops from 70Mbps to almost nothing and disappearing from the disk-IO using process list.
in wnbd_clent stats, the aborted counts start to increase

successive wnbd-client stats, take a much longer time to respond.

Just when the-stuck has kicked in 
E:\vstorage\nbd\wnbd_client>wnbd-client.exe stats --instance-name gkpics01
Disk stats
TotalReceivedIORequests        : 8689
TotalSubmittedIORequests       : 8689
TotalReceivedIOReplies         : 8677
UnsubmittedIORequests          : 0
PendingSubmittedIORequests     : 12
AbortedSubmittedIORequests     : 6
AbortedUnsubmittedIORequests   : 0
CompletedAbortedIORequests     : 0
OutstandingIOCount             : 6

The below file (contains random bytes) in the xcopy of the synthetic generated data set is when the system become stuck

PS C:\Users\gana> pwsh -Command { $sdate=$(date) ; echo "$sdate" ; xcopy /E E:\test\sgdata K:\ ; echo "$sdate" ; date } 
12/18/2021 15:41:18
E:\test\sgdata\photos001\find.CHECKSUM
:
E:\test\sgdata\photos001\D0000_88114831\File_0000.binE:\test\sgdata\photos001\D0011_325a7df4\D0007_cfa36bee\File_0000.bin

E:\vstorage\nbd\wnbd_client>dir H:
 Volume in drive H is CPERF0
 Volume Serial Number is F196-DB9E

 Directory of H:\

12/18/2021  03:44 PM     6,194,987,008 gkpics01.vhdx
               1 File(s)  6,194,987,008 bytes
               0 Dir(s)  149,989,228,544 bytes free

E:\vstorage\nbd\wnbd_client>wnbd-client.exe stats --instance-name gkpics01
Disk stats
TotalReceivedIORequests        : 8706
TotalSubmittedIORequests       : 8695
TotalReceivedIOReplies         : 8677
UnsubmittedIORequests          : 11
PendingSubmittedIORequests     : 18
AbortedSubmittedIORequests     : 19
AbortedUnsubmittedIORequests   : 9
CompletedAbortedIORequests     : 0
OutstandingIOCount             : 1

E:\vstorage\nbd\wnbd_client>wnbd-client.exe stats --instance-name gkpics01
Disk stats
TotalReceivedIORequests        : 8841
TotalSubmittedIORequests       : 8695
TotalReceivedIOReplies         : 8677
UnsubmittedIORequests          : 146
PendingSubmittedIORequests     : 18
AbortedSubmittedIORequests     : 19
AbortedUnsubmittedIORequests   : 144
CompletedAbortedIORequests     : 0
OutstandingIOCount             : 1

hgkamath commented 2 years ago

In the below test I use a qemu VM to do copy from inside VM.

test1
- test: copy sgdata onto qcow2 disk
- result: VHDX corrupted
- comment: nothing surprising here
test 2
- test: copy sgdata onto qcow2 disk
- result: QCOW2 corrupted
- comment: surprising, but no system DriveIO lockup. I will need to test this on linux as well

Preparation

I use alpine linux, (alpine-standard-3.15.0-x86_64.iso) because that was the only live-cd as of today with kernel-5.15.4 which has the paragon ntfs3 driver and boots into multi-user non-graphical console login.
Synthetic generated data is copied into E:\test\sgdata.vhdx so that it can be attached as a drive to the VM
Fresh empty gkpics01.vhdx is placed in H:\
Create empty qcow2: qemu-img.exe create -f qcow2 H:\gkpics01.qcow2 99723771904

After booting Alpine Linux 3.15

setup-alpine: us, us, localhost, eth0, dhcp, no-manual-nw, timezone, noproxy, select repository, openssh, nodrives, nocfg, default apkcache
apk add gzip rsync e2fsprogs sgdisk gptfdisk
mkdir /mnt/a /mnt/b
mount partition of src drive (sgdata.vhdx): mount -t ntfs3 /dev/sdb2 /mnt/b
mount partition of dst drive
- for vhdx: mount -t ntfs3 /dev/sda2 /mnt/a
- for qcow2:
  - create partition /dev/sda1 using gdisk
  - mkfs -t ext4 -L fs_gkpics_01 /dev/sda1
  - mount -t ext4 /dev/sda1 /mnt/a

VHDX TEST

In cmd prompt: 
C:\Windows> copy E:\vstorage\gkpics01_new.vhdx H:\gkpics01.vhdx

C:\Users\gana>dir E:\vstorage\gkpics01_new.vhdx
 Volume in drive E is 15KJ_Images
 Volume Serial Number is 1EA6-2771

 Directory of E:\vstorage

11/12/2021  01:21 PM       205,520,896 gkpics01_new.vhdx
               1 File(s)    205,520,896 bytes
               0 Dir(s)  360,436,940,800 bytes free

PS E:\vstorage\nbd\qemu-nbd-6.0> & C:\vol\scoop_01\scoopg\shims\qemu-system-x86_64 -cpu qemu64 -m 4096 -machine "type=q35,kernel-irqchip=off" -accel whpx -smp "sockets=1,cores=8,threads=1" -boot d -cdrom E:\transcend\alpine-standard-3.15.0-x86_64.iso -hda H:\gkpics01.vhdx -hdb E:\test\sgdata.vhdx  -display gtk -vga virtio -rtc base=utc -net user -device "virtio-net,netdev=vmnic" -netdev "user,id=vmnic,net=192.168.20.0/24,dns=192.168.20.3,dhcpstart=192.168.20.15"
C:\vol\scoop_01\scoopg\apps\qemu\current\qemu-system-x86_64.exe: warning: GdkPixbuf: Cannot open pixbuf loader module file 'C:\vol\scoop_01\scoopg\apps\qemu\current\lib\gdk-pixbuf-2.0\2.10.0\loaders.cache': No such file or directory

This likely means that your installation is broken.
Try running the command
  gdk-pixbuf-query-loaders > C:\vol\scoop_01\scoopg\apps\qemu\current\lib\gdk-pixbuf-2.0\2.10.0\loaders.cache
to make things work again for the time being.
Windows Hypervisor Platform accelerator is operational
C:\vol\scoop_01\scoopg\apps\qemu\current\qemu-system-x86_64.exe: warning: hub 0 with no nics

Inside VM, prepare machine as mentioned earlier, setup-alpine .. mount partitions

localhost:/mnt# (sdate=`date` ; cd /mnt/b ; rsync -avH ./photos001 /mnt/a ; echo "$sdate" ; date ) 
localhost:/mnt# ls -l /mnt/a/photos001

The VHDX fails badly, even an ls -l did not succeed. Reboot was not required. nothing surprising here.

QCOW2 TEST

E:\vstorage\nbd\wnbd_client>C:\vol\scoop_01\scoopg\apps\qemu\current\qemu-img.exe create -f qcow2 H:\gkpics01.qcow2 99723771904
Formatting 'H:\gkpics01.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=99723771904 lazy_refcounts=off refcount_bits=16

E:\vstorage\nbd\wnbd_client>dir H:\gkpics01.qcow2
 Volume in drive H is CPERF0
 Volume Serial Number is F196-DB9E

 Directory of H:\

12/19/2021  07:52 PM           198,144 gkpics01.qcow2
               1 File(s)        198,144 bytes
               0 Dir(s)  156,183,953,408 bytes free

PS E:\vstorage\nbd\qemu-nbd-6.0> & C:\vol\scoop_01\scoopg\shims\qemu-system-x86_64 -cpu qemu64 -m 4096 -machine "type=q35,kernel-irqchip=off" -accel whpx -smp "sockets=1,cores=8,threads=1" -boot d -cdrom E:\transcend\alpine-standard-3.15.0-x86_64.iso -hda H:\gkpics01.qcow2 -hdb E:\test\sgdata.vhdx  -display gtk -vga virtio -rtc base=utc -net user -device "virtio-net,netdev=vmnic" -netdev "user,id=vmnic,net=192.168.20.0/24,dns=192.168.20.3,dhcpstart=192.168.20.15"
C:\vol\scoop_01\scoopg\apps\qemu\current\qemu-system-x86_64.exe: warning: GdkPixbuf: Cannot open pixbuf loader module file 'C:\vol\scoop_01\scoopg\apps\qemu\current\lib\gdk-pixbuf-2.0\2.10.0\loaders.cache': No such file or directory

This likely means that your installation is broken.
Try running the command
  gdk-pixbuf-query-loaders > C:\vol\scoop_01\scoopg\apps\qemu\current\lib\gdk-pixbuf-2.0\2.10.0\loaders.cache
to make things work again for the time being.
Windows Hypervisor Platform accelerator is operational
C:\vol\scoop_01\scoopg\apps\qemu\current\qemu-system-x86_64.exe: warning: hub 0 with no nics

Inside VM, prepare machine as mentioned earlier, setup-alpine .. mount partitions rsync copy

localhost:/mnt# (sdate=`date` ; cd /mnt/b ; rsync -avH ./photos001 /mnt/a ; echo "$sdate" ; date ) 
:
Sent 90,963,665,782 bytes  recevied 298,855 bytes  60,622,431.61 bytes/sec
total size is 90,940,594,995  speedup is 1.00
Sun Dec 19 20:19:26 IST 2021
Sun Dec 19 20:44:26 IST 2021

Verifying source drive

localhost:/mnt# (sdate=`date` ; cd /mnt/b/photos001 ; sha256sum -c ./find.CHECKSUM | tee /tmp/fb ; echo "$sdate" ; date ) 
:
Sun Dec 19 21:28:09 IST 2021
Sun Dec 19 21:48:35 IST 2021
localhost:/mnt# cat /tmp/fb | grep -v ": OK" | wc -l  
0

Verifying the Destination drive containing with the rsynced/copied sgdata

localhost:/mnt# (sdate=`date` ; cd /mnt/a/photos001 ;  sha256sum -c ./find.CHECKSUM | tee /tmp/fa ; echo "$sdate" ; date ) 
:
/D0039_1923a8ec/File_0002.bin: OK
Sun Dec 19 20:47:53 IST 2021
Sun Dec 19 21:07:39 IST 2021
localhost:/mnt# cat /tmp/fa | grep -v ": OK" | wc -l  
202

!! QCOW2 is corrupted

It is true that there are bugs in both code pathway of VHDX and qcow2 qemu virtual-disks, which show up even without the code-pathway of nbd being involved. However, it is also true that no system-stuck lockup happened.
The system lockup is the doing of the wnbd driver. That is why I think wnbd itself does also have a bug.

The nbd protocol, and the wnbd kernel driver need to be robust/resilient towards failures in the nbd-peer. So if nbd peer is screwing up, at worst it should disconnect drive backend connection, or refuse service, or try re-transmission, and if it is re-transmission, it shouldn't get stuck in an infinite loop, locking up the whole system

The system drive-io lockup is reminiscent of situations like bad hard-drive bad-sector errors, or cdrom-read errors, where-in the windows-system get stuck trying to do a retry without ever succeeding, or taking a long time to move on. I don't think the my SSD drive really has problem blocks. Data integrity is fine when large virtual-disks are copied onto and off. So because qemu-nbd is screwing up, wnbd is locking up.

Summary posted to bug1-727 comment https://gitlab.com/qemu-project/qemu/-/issues/727#note_789170496

petrutlucian94 commented 2 years ago

So in your case, some IO operations are hanging on the nbd server side. Windows (storport.sys to be precise) will cancel pending IO requests after about 30s, in which case that AbortedSubmittedIORequests counter will be increased.

It's true that the Windows explorer usually hangs if the drives become unresponsive, but that's not WNBD's fault. It's the same with directly attached disk drives or SMB shares. WNBD simply receives IO requests from Storport and forwards them to an NBD server (or to a registered userspace app that uses libwnbd, but that's a different topic).

Storport miniport drivers, such as WNBD, are not supposed to retry canceled operations. At best, we may try to detect IO deadlocks and disconnect the disk, but that should be configurable.

A few other tests that you could make:

retry your scenario on a Linux host
- use a qemu vm with directly attached image files
- use a qemu vm with loopback devices attached
- connect to a qemu-nbd disk using the linux client
run a nbd server on Linux and connect to it using wnbd
- use nbd-wnbd
- use the standalone nbd server with a raw image file https://github.com/NetworkBlockDevice/nbd#using-nbd

petrutlucian94 commented 2 years ago

I'm going to close this issue since it's not a WNBD driver bug, but rather a qemu-nbd or environment issue.

hgkamath commented 2 years ago

sorry for not updating, This is on my mind, but didn't yet got to do it. I'm still yet to test qemu-7 , which just got recently released. qemu-project also released a new binary qemu-storage-daemon.exe which also can act like a nbd server, amongst other things.

Just some updates. The kernel-5.15.x series upto 5.15.24, had ntfs3 read-side bugs, and upto 5.17.x series there were folio/mmotm feature commits and I had taken a pause until some other qemu/kernel bugs sorted out, so that there is less ambiguity and compounding. I don't think the wnbd driver is in the clear yet. But it is true, that some of the negative test-results above, that I did, after taking your very-good-suggestion of testing wholly inside a qemu-vm, were on account of these bugs.

I'll reopen this when I get to test it and it seems more sure that it isn't due to a qemu/linux-guest-kernel/linux-host-kernel issue

Also on a side note, I wish cloudbase released a regular-signed wnbd driver, which you should, if you think it is bug free. On my laptop, in order to use the testsigned driver, I need to use the 1 time disable device-driver-signature-enforcement special boot, or use the testsign mode which the HP-UEFI requires me to disable secureboot. The test-mode also puts a ugly wallpaper watermark. This hassle is what creates reluctance to test. Suppose things work well, I am not sure if I would go down a decision of mounting drives via wnbd this way on a regular basis, if it requires me to permanently turn off secureboot.

petrutlucian94 commented 2 years ago

Thanks for feedback. We're currently shipping a signed version of the driver with the Ceph msi installer: https://cloudbase.it/ceph-for-windows/, I hope that helps. In the future, we may consider a standalone WNBD installer.

hgkamath commented 2 years ago

I think the ceph-msi-installer ceph_pacific_beta.msi which is still beta is also just test-signed. (am i wrong?) I had used the same before also.

Open device manager, right click any wnbd SCSI virtual adapter and uninstall device
Open admin cmd prompt
(only if updating driver) Uninstall ceph from programs and features
Ensure that there is no prior certificate under trusted publisher from Cloudbase Solutions Srl . a. Open cmd certmgr b. Delete cloudbase certificate under trusted-publishers/certificates
(only if updating driver) Reinstall ceph
issue command : .\wnbd-client.exe install-driver --filename 'C:\Program Files\Ceph\driver\wnbd.inf'
Dialog prompt comes up: "would you like to install this device software with name and publisher a. ensure checkbox always trust software from Cloudbase Solutions Srl is checked b. click install
Driver is installed, and wnbd driver shows up under storage controllers in device manager
Close certmgr.exe, wait a minute and again reopen certmgr.exe. a. verify that cloudbase certificate is again present under trusted-publishers/certificates
b. Double click/Open the certificate. Verify - . 1. In general tab: The digital signature is OK , certificate validity is from 8/13/2021 to 8/12/2024 . 2. In details tab: Key usage under has yellow triangle for Key Usage : Digital Signature (80)
However, driver in device mgr has yellow triangle. and right click property gives code 52 Windows cannot verify the digital signature for the drivers required for this device. A recent hardware or software change might have installed a file that is signed incorrectly or damaged, or that might be malicious software from an unknown source. (Code 52)

PS E:\vstorage\nbd\wnbd_client> .\wnbd-client.exe -v
wnbd-client.exe: 0.2.2-4-g10c1fbe
libwnbd.dll: 0.2.2-4-g10c1fbe
libwnbd.dll!WnbdOpenAdapterEx ERROR No WNBD adapter found. Please make sure that the driver is installed.

driverver from wnbd.inf: DriverVer = 04/28/2022,6.15.27.572

petrutlucian94 commented 2 years ago

Right, it's signed using a Cloudbase certificate and not by Microsoft, which is why secure boot will have to remain disabled. However, you may disable the "testsigning" setting (allowing self signed drivers) when using the Ceph msi.

We decided to keep the "beta" label until we implement CI testing and get more user feedback. The good news is that a Windows Ceph CI job has been added recently [1] and that we got quite a few users already, which provide valuable feedback (as you did already).

A stable driver release will also go through the Microsoft driver attestation and signing process. Unfortunately, I can't provide an ETA but I hope that this will be ready by the next Ceph release.

[1] https://jenkins.ceph.com/job/ceph-windows-pull-requests/6850/

Jazz9 commented 2 years ago

I have been chasing the same issue for a few months now. 16.2.7 ceph and windows 2019. I see this across 4 separate clusters. The larger the cluster the worse it seems to be ( ranging from 50 -340 osds per cluster )

We had a firmware bug where OSDs restarted randomly and this would trigger the windows RBD volume to lock up and never return. A restart of the windows server would sometimes cause refs to corrupt and go RAW. We have implemented snapshots to try mitigate this.

A planned reboot of a ceph node doesn't seem to cause the issue. Only a OSD assert.

The fault is characterized by PendingSubmittedIORequests going up but not aborts.

We are open to paying for support to correct this issue.

petrutlucian94 commented 2 years ago

@Jazz9 thanks for feedback. The issue may be slightly different as the OP wasn't using RBD but actually NBD (wnbd can also act as a standalone nbd client) and it affected other NBD clients as well.

There might be some problem around aborted or timed out requests, so I think it's worth reopening this issue. The bug may also reside on the Ceph side (rbd-wnbd leverages wnbd), which may not be able to gracefully handle lost OSD connections (at least on Windows). Just checking: have you encountered similar issues with Linux RBD clients?

As a side note, there is a recent Ceph fix that replaces select() with poll() on Windows in order to avoid fd limitations, which can be a problem when having a large amount of OSDs: https://github.com/ceph/ceph/pull/46525.

We'll definitely need to investigate this issue. Unfortunately, I'm no longer actively working on this project (at least for now), so I can't provide an ETA. One of my colleagues may be able to reproduce it.

If needed, you may get in touch with our support team by filling this form: https://cloudbase.it/about/#contact

petrutlucian94 commented 2 years ago

We found the issue. If all the IO dispatch workers are waiting for RBD callbacks indefinitely, the userspace (libwnbd) may never get notified of mapping removals. We'll need to improve the way in which the driver informs the IO dispatcher that the disk has been removed (e.g. by using a seperate control thread or some disconnect events).

We'll also have to avoid waiting for IO operations indefinitely on the rbd side (e.g. maybe double check our timeouts).

Jazz9 commented 2 years ago

Great! Normally windows is really good at waiting for IO.

I would expect that the rbd side would need to timeout and retry the IO operation ( 20 sec timeout? ). That would assist if the OSD had restarted or the PG peered to another OSD.

Do you have a rough ETA on the fix?

petrutlucian94 commented 2 years ago

There might be more to it since all rbd-wnbd IO callbacks are actually async and should return immediately (sometimes they don't and we see the hang). We even saw librbd::Image::close hang, the async version is likely going to hang as well.

It's a bit tricky to reproduce since on our env it only happens once per ~1000 rbd mappings. My colleague's been trying out pacific, quincy and the master ceph branch to see if it's a regression and collect some logs/process dumps.

While there are a few possible mitigations, we'll need to look into the hanging async calls which is probably the root cause of the issue. Unfortunately I can't provide an ETA, I'll try to assist the colleagues that are currently maintaining the project.

petrutlucian94 commented 2 years ago

A small update: looks like we're affected by this winpthreads bug that causes shared mutexes to deadlock: https://github.com/msys2/MINGW-packages/issues/3319. Trace: https://pastebin.com/raw/i3jpTyS3

Ideally, we should get winpthreads fixed. The alternative would be to use a different shared mutex implementation. We did consider building Ceph using Clang or MSVC but that would require a large amount of code changes on the Ceph side.

lhmouse commented 2 years ago

Hello, mingw-w64 developer speaking. I have made some efforts to replace winpthreads due to its lack of maintainability. Do you have any interest in https://gcc-mcf.lhmouse.com/? The mcfgthread library doesn't implement RW locks, but libstdc++ has an alternative implementation using condition variables, so shared_mutex also works there.

petrutlucian94 commented 2 years ago

hi @lhmouse, mcfgthread seems interesting, thanks a lot for the suggestion.

We're currently cross-compiling Ceph using Linux, so we'll have to build mcfgthread and also gcc with the custom patches on top. Would you recommend using a specific gcc version? I see that there are some wip patches for gcc 12: https://github.com/lhmouse/MINGW-packages/tree/master/mingw-w64-gcc

Thank you for your work on mingw, btw :).

lhmouse commented 2 years ago

hi @lhmouse, mcfgthread seems interesting, thanks a lot for the suggestion.

We're currently cross-compiling Ceph using Linux, so we'll have to build mcfgthread and also gcc with the custom patches on top. Would you recommend using a specific gcc version? I see that there are some wip patches for gcc 12: https://github.com/lhmouse/MINGW-packages/tree/master/mingw-w64-gcc

Only the 9000- and 9001- patches are necessary; the later is required for building libgfortran, otherwise it can be dropped.

The mcfgthread library provides its own __cxa_atexit() and __cxa_finalize() implementations, which requires modification of the CRT. A complete set of patches (for both GCC and mingw-w64) is available at https://github.com/lhmouse/mcfgthread/tree/master/patches; please refer to the README.md file for details.

The patch for GCC should apply to GCC 11 and 12 branches cleanly. As far as I remember there haven't been any changes since GCC 9; probably it applies to 9 and 10 branches as well.

I built my personal toolchains using the standard cross-compilation procedure, which was necessary to replace the default CRT in MSYS2. It should also work when cross-compiling from Linux, I think.

Thank you for your work on mingw, btw :).

My pleasure.

Jazz9 commented 2 years ago

It looks like you are close to a resolution on this issue. Do you have a ETA on the release?

petrutlucian94 commented 2 years ago

It looks like you are close to a resolution on this issue. Do you have a ETA on the release?

the fix got approved, it's in the Ceph QA queue right now. I expect it to be merged in a couple of weeks. Afterwards, I'll backport it to Quincy and Pacific.

Jazz9 commented 2 years ago

@petrutlucian94 - Did you have an update on the backport / availability of this fix?

petrutlucian94 commented 2 years ago

@petrutlucian94 - Did you have an update on the backport / availability of this fix?

hi, the fix has been backported to pacific and quincy. there are a few more patches that we'd like to include before releasing a new MSI.

hgkamath commented 2 years ago

RC release of quincy 17.2.4 is due shortly 20220919 https://lists.ceph.io/hyperkitty/list/dev@ceph.io/thread/DUWEJXQH4XRU5UBQOH7IZXVJVCAMOMUJ/ I hope the task of making a signed windows driver for the final-release doesn't slip though the cracks.

petrutlucian94 commented 2 years ago

RC release of quincy 17.2.4 is due shortly 20220919 https://lists.ceph.io/hyperkitty/list/dev@ceph.io/thread/DUWEJXQH4XRU5UBQOH7IZXVJVCAMOMUJ/ I hope the task of making a signed windows driver for the final-release doesn't slip though the cracks.

actually, we've just prepared Pacific and Quincy releases that include some of the fixes that didn't merge upstream yet. we've just finished testing it, so expect a new msi the following days.

for what is worth, the wnbd driver remains mostly the same, all the fixes that we had were actually targeting ceph, not the driver.

petrutlucian94 commented 2 years ago

The Ceph MSIs have been updated: https://cloudbase.it/ceph-for-windows/

I think we can close this issue now. Just to clarify, we ended up using the same GitHub issue for two (or more) different problems:

@hgkamath's initial issue involving the nbd client, completely unrelated to Ceph. since the same issue was happening with other nbd clients, it seemed like an environment or nbd server issue
@Jazz9's issue regarding hanging RBD IO. we found out that it was caused by a faulty mingw shared mutex implementation, which also affected other async operations. switching to boost::shared_mutex fixed the issue. This simple fix was backported upstream to pacific and queens and new MSIs have been released.
@hgkamath's concern about driver signing - the driver is currently signed by Cloudbase, not Microsoft. This means that secure boot has to be disabled. Going through the Microsoft certification and signing process is not a top priority for us right now. This process usually takes weeks and has to be repeated whenever a new version is released. In fact, lots of device vendors provide an up to date driver signed by them and a significantly out of date one signed by Microsoft, simply because of the tedious process.

hgkamath commented 2 years ago

I'm gonna test the latest beta ceh-quincy-17.2.4 with qemu-7.1 out of curiosity. Disabling secure-boot may be deterrent for me to long term transition to qemu<->nbd<->wnbd-nbd-driver<->diskmgmt

Please, name some of the device vendors (co.) and their corresponding devices (model no), that have a nbd driver. I suspect such strategy by them could work, if the base case covers 99% of all user needs. Those users that want to live on edge with latest bleeding edge non-mainstream features, will sort of be the type who know what they are doing, entertain risk and disable secureboot. In this case we are yet to determine if this wnbd-nbd driver, covers as a base case reliable nbd driver.

Btw, I think qemu can also provide a rbd device, I think. would I have better luck with qemu<->rbd<->wnbd-rbd-driver<->diskmgmt

petrutlucian94 commented 2 years ago

Please, name some of the device vendors (co.) and their corresponding devices (model no), that have a wnbd driver.

Can't remember if it was a chelsio or a cisco network card driver - the Microsoft signed driver was about 2 years behind the vendor signed driver, most probably because of the same reason. It definitely wasn't a wnbd driver, just their network card driver. My point was that other vendors are affected by this as well and prefer not to go through the Microsoft signing process.

petrutlucian94 commented 2 years ago

Btw, I think qemu can also provide a rbd device, I think. would I have better luck with qemu<->rbd<->wnbd-rbd-driver<->diskmgmt

at that point you can just take qemu-nbd out of the picture and use rbd. it depends on your use case.

hgkamath commented 2 years ago

20220926 Changes that may be in effect since when tests were last done and now

wnbd driver is still at 0.2.2, though ceph release is changed from Pacific to Quincy, build number is changed from 3 to 11
linux kernel was 5.15 back then, it is passed 5.18 now, kernel features like mmotm/folios have settled
The new paragon ntfs3 driver of the linux kernel, first introduced in 5.15 has had many bug fixes
qemu is changed from 6.0 to 7.1. some qemu-general-bugs, some qemu-Hyper-V bugs, some qemu-nbd-bugs and some qemu-block-driver-bugs have been fixed allowing for qemu-VM stability

Haswell i7-4700QM 
WinOS, qemu-VM, Test-image all on SSD, sgdata (synthetic generated data) on HDD

Windows-Home
winver: 21H2 (OS Build: 19044.2075) 

Fedora 36 Workstation
Linux fedora 5.18.17-200.fc36.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Aug 11 14:36:06 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Ceph-Quincy-17.2.4 20220920
E:\vstorage\nbd>E:\vstorage\nbd\wnbd_client\wnbd-client.exe -v
wnbd-client.exe: 0.2.2-11-g3dbec5e
libwnbd.dll: 0.2.2-11-g3dbec5e
wnbd.sys: 0.2.2-11-g3dbec5e

E:\vstorage\nbd>qemu-nbd --version
qemu-nbd 7.1.0 (v7.1.0-11925-g4ec481870e-dirty)
Written by Anthony Liguori.

E:\vstorage\nbd>C:\vol\scoop_01\scoopg\apps\qemu\current\qemu-storage-daemon.exe --version
qemu-storage-daemon version 7.1.0 (v7.1.0-11925-g4ec481870e-dirty)
Copyright (c) 2003-2022 Fabrice Bellard and the QEMU Project developers

TESTS

Test-01 :

Description:
- dst=rawimg-fixed, data-copy and verify, inside qemu-VM (no wnbd control-case)
- rawimg single partition, ntfs3 formatted inside qemu-VM
- Win10-21H2-1904.2075/whpx/qemu-7.1/rawimg/Fedora-36/kernel-5.18.17-200
- commandline rsync: src=attached-vhdx/windows-formatted-ntfs/ntfs3 dst=rawimg-fixed/single-partition/ntfs3
Duration: copy=33min, verify=16min
Result: success

Test-02 :

Description:
- dst=rawimg-fixed, data-copy and verify, using qemu-nbd, wnbd driver and Windows-explorer-GUI
- rawimg single partition, ntfs3 formatted inside qemu-VM
- Win10-21H2-1904.2075/qemu-nbd-7.1/rawimg/wnbd-0.2.2-11-g3dbec5e
- Windows-Explorer Gui Copy: src=E:\test\sgdata dst=K:(wnbd/qemu-nbd/rawimg-fixed/single-partition/ntfs3)
Duration: not well timed copy=~25 min , verify=15 min
Result: success

Test-03 :

Description:
- dst=rawimg-fixed, data-copy and verify, using qemu-nbd, wnbd driver and XCOPY
- rawimg, intialized and ntfs formatted by diskmgmt, which creates a small 100Mb win-reserved-partition before the ntfs partition.
- Win10-21H2-1904.2075/qemu-nbd-7.1/rawimg/wnbd-0.2.2-11-g3dbec5e
- XCOPY command-line copy: src=E:\test\sgdata dst=K:(wnbd/qemu-nbd/rawimg-fixed/single-partition/ntfs3)
Duration: copy=21min, verify=12min
Result: success

Test-04 :

Description:
- dst=rawimg-fixed, data-copy and verify, using qemu-storage-daemon, wnbd driver and XCOPY
- rawimg, intialized and ntfs formatted by diskmgmt, which creates a small 100Mb win-reserved-partition before the ntfs partition.
- Win10-21H2-1904.2075/qemu-storage-daemon-7.1/rawimg/wnbd-0.2.2-11-g3dbec5e
- XCOPY command-line copy: src=E:\test\sgdata dst=K:(wnbd/qemu-storage-daemon/rawimg-fixed/single-partition/ntfs3)
Duration: copy=21min, verify=13min
Result: success

Test-05 :

Description:
- dst=qcow2-fixed, data-copy and verify, inside qemu-VM (no wnbd control-case)
- qcow2-fixed single partition, ntfs3 formatted inside qemu-VM
- Win10-21H2-1904.2075/whpx/qemu-7.1/qcow2-fixed/Fedora-36/kernel-5.18.17-200
- commandline rsync: src=attached-vhdx/windows-formatted-ntfs/ntfs3 dst=qcow2-fixed/single-partition/ntfs3
Duration: copy=33min, verify=16min
Result: success

Test-06 :

Description:
- dst=qcow2-fixed, data-copy and verify, using qemu-storage-daemon, wnbd driver and XCOPY
- qcow2-fixed, intialized and ntfs formatted by diskmgmt
- Win10-21H2-1904.2075/qemu-storage-daemon-7.1/rawimg/wnbd-0.2.2-11-g3dbec5e
- XCOPY command-line copy: src=E:\test\sgdata dst=K:(wnbd/qemu-storage-daemon/qcow2-fixed/single-partition/ntfs3)
Duration: copy=22min, verify=12min
Result: success

Test-07 :

Description:
- dst=qcow2-dyn, data-copy and verify, inside qemu-VM (no wnbd control-case)
- qcow2-dyn single partition, ntfs3 formatted inside qemu-VM
- Win10-21H2-1904.2075/whpx/qemu-7.1/qcow2-fixed/Fedora-36/kernel-5.18.17-200
- commandline rsync: src=attached-vhdx/windows-formatted-ntfs/ntfs3 dst=qcow2-dyn/single-partition/ntfs3
Duration: copy=? min, verify=? min
Result: ? https://gitlab.com/qemu-project/qemu/-/issues/814

Test-08 :

Description:
- dst=qcow2-dyn, data-copy and verify, using qemu-storage-daemon, wnbd driver and XCOPY
- qcow2-dyn, intialized and ntfs formatted by diskmgmt
- Win10-21H2-1904.2075/qemu-storage-daemon-7.1/rawimg/wnbd-0.2.2-11-g3dbec5e
- XCOPY command-line copy: src=E:\test\sgdata dst=K:(wnbd/qemu-storage-daemon/qcow2-dyn/single-partition/ntfs3)
Duration: copy=? min, verify=? min
Result: ? https://gitlab.com/qemu-project/qemu/-/issues/814

Test-09 :

Description:
- dst=vhdx-fixed, data-copy and verify, Windows Direct attach
- vhdx-fixed single partition, ntfs3 formatted by diskmgmt
- Win10-21H2-1904.2075
- XCOPY command-line copy: src=E:\test\sgdata dst=K:(vhdx-dyn)
Duration: copy=16 min, verify=12 min
Result: success

Test-10 :

Description:
- dst=vhdx-fixed, data-copy and verify, inside qemu-VM (no wnbd control-case)
- vhdx-fixed single partition, ntfs3 formatted inside qemu-VM
- Win10-21H2-1904.2075/whpx/qemu-7.1/qcow2-fixed/Fedora-36/kernel-5.18.17-200
- commandline rsync: src=attached-vhdx/windows-formatted-ntfs/ntfs3 dst=vhdx-fixed/single-partition/ntfs3
Duration: copy=? min, verify=? min
Result: ? https://gitlab.com/qemu-project/qemu/-/issues/1236 https://gitlab.com/qemu-project/qemu/-/issues/806

Test-11 :

Description:
- dst=vhdx-fixed, data-copy and verify, using qemu-storage-daemon, wnbd driver and XCOPY
- qcow2-dyn, intialized and ntfs formatted by diskmgmt
- Win10-21H2-1904.2075/qemu-storage-daemon-7.1/rawimg/wnbd-0.2.2-11-g3dbec5e
- XCOPY command-line copy: src=E:\test\sgdata dst=K:(wnbd/qemu-storage-daemon/vhdx-fixed/single-partition/ntfs3)
Duration: copy=? min, verify=? min
Result: ? https://gitlab.com/qemu-project/qemu/-/issues/1236 https://gitlab.com/qemu-project/qemu/-/issues/806

Test-12 :

Description:
- dst=vhdx-dyn, data-copy and verify, Windows Direct attach
- vhdx-dyn single partition, ntfs3 formatted by diskmgmt
- Win10-21H2-1904.2075
- XCOPY command-line copy: src=E:\test\sgdata dst=K:(vhdx-dyn)
Duration: copy=? min, verify=? min
Result: ?

Test-13 :

Description:
- dst=vhdx-dyn, data-copy and verify, inside qemu-VM (no wnbd control-case)
- vhdx-dyn single partition, ntfs3 formatted inside qemu-VM
- Win10-21H2-1904.2075/whpx/qemu-7.1/qcow2-fixed/Fedora-36/kernel-5.18.17-200
- commandline rsync: src=attached-vhdx/windows-formatted-ntfs/ntfs3 dst=vhdx-dyn/single-partition/ntfs3
Duration: copy=? min, verify=? min
Result: ? https://gitlab.com/qemu-project/qemu/-/issues/727

Test-14 :

Description:
- dst=vhdx-dyn, data-copy and verify, using qemu-storage-daemon, wnbd driver and XCOPY
- vhdx-dyn, intialized and ntfs formatted by diskmgmt
- Win10-21H2-1904.2075/qemu-storage-daemon-7.1/rawimg/wnbd-0.2.2-11-g3dbec5e
- XCOPY command-line copy: src=E:\test\sgdata dst=K:(wnbd/qemu-storage-daemon/vhdx-dyn/single-partition/ntfs3)
Duration: copy=? min, verify=? min
Result: ? https://gitlab.com/qemu-project/qemu/-/issues/727

Test-15 :

Description:
- dst=rawfsimg-fixed, data-copy and verify, Imdisk attach
- rawfsimg-fixed single partition, ntfs3 formatted by diskmgmt
- Win10-21H2-1904.2075
- XCOPY command-line copy: src=E:\test\sgdata dst=K:(rawfsimg-fixed)
Duration: copy=18 min, verify=12 min
Result: success

Test-16 :

Description:
- dst=vhdx-fixed, data-copy and verify, Imdisk attach
- vhdx-fixed single partition, ntfs3 formatted by diskmgmt
- Win10-21H2-1904.2075
- XCOPY command-line copy: src=E:\test\sgdata dst=K:(vhdx-fixed)
Duration: copy=18 min, verify=12 min
Result: success

Test-17 :

Description:
- dst=vhdx-dyn, data-copy and verify, Imdisk attach
- vhdx-dyn single partition, ntfs3 formatted by diskmgmt
- Win10-21H2-1904.2075
- XCOPY command-line copy: src=E:\test\sgdata dst=K:(vhdx-dyn)
Duration: copy=? min, verify=? min
Result: ?

Notes

qcow2-fixed had to be created under a Linux-Host
imdisk uses discutils which does not support qcow2, only rawfsimg, dmg, vdi, vhd, vhdx, vmdk, Xva

Commands involved

Creating a disk image in Linux

raw

dd if=/dev/zero of=/mnt/a16/gkpics01.raw.img bs=1048576 count=95104 status=progress

qcow2-fixed

qemu-img create -f qcow2 H:\gkpics01.qcow2 -o preallocation=full 99723771904

qcow2-dyn

qemu-img create -f qcow2 /mnt/a16/gkpics01.qcow2 99723771904

vhdx-fixed

qemu-img create -f vhdx -o subformat=fixed,,block_state_zero=off /mnt/a16/gkpics01.vhdx 99723771904

vhdx-dyn

qemu-img create -f vhdx -o subformat=dynamic /mnt/a16/gkpics01.vhdx 99723771904

Creating a disk image in Windows

raw

"C:\vol\prog_01\prog\windd_chrysocome\ddrelease64.exe" bs=1048576 count=95104 if=/dev/zero of=H:/gkpics01.raw.img --progress

qcow2-fixed ( Not Do-able as of now https://gitlab.com/qemu-project/qemu/-/issues/813 )

qemu-img.exe create -f qcow2 H:\gkpics01.qcow2 -o preallocation=full 99723771904

qcow2-dyn

qemu-img create -f qcow2 /mnt/a16/gkpics01.qcow2 99723771904

vhdx-fixed

New-VHD -Path I:\gkpics01.vhdx -SizeBytes 99723771904 -Fixed
or
qemu-img create -f vhdx -o subformat=fixed,block_state_zero=off F:/gkpics01.vhdx 99723771904
fsutil sparse setFlag F:\gkpics01.vhdx 0

vhdx-dyn

New-VHD -Path I:\gkpics01.vhdx -SizeBytes 99723771904 -Dynamic
or
qemu-img create -f vhdx -o subformat=dynamic F:/gkpics01.vhdx 99723771904
fsutil sparse setFlag F:\gkpics01.vhdx 0

Blanking a block-device in Linux / Linux-VM

[root@fedora mnt]# ( sdate=`date` ; echo "$sdate" ; dd if=/dev/zero of=/dev/vdc bs=$((512*204)) status=progress ; echo "$sdate" ; date )
Mon Sep 26 09:08:25 AM IST 2022
99677337600 bytes (100 GB, 93 GiB) copied, 1919 s, 51.9 MB/s
dd: error writing '/dev/vdc': No space left on device
954770+0 records in
954769+0 records out
99723771904 bytes (100 GB, 93 GiB) copied, 1920.97 s, 51.9 MB/s
Mon Sep 26 09:08:25 AM IST 2022
Mon Sep 26 09:40:26 AM IST 2022

Make ntfs fs under Linux-VM

[root@fedora gana]# mkfs -t ntfs -Q -L fs_gkpics01 /dev/vdc1

Run an nbd-server in windows

qemu-nbd as nbd-server

E:\vstorage\nbd>qemu-nbd.exe H:\gkpics01.raw.img -x gkpics01
WARNING: Image format was not specified for 'H:\gkpics01.raw.img' and probing guessed raw.
         Automatically detecting the format is dangerous for raw images, write operations on block 0 will be restricted.
         Specify the 'raw' format explicitly to remove the restrictions.

qemu-storage-daemon as nbd-server

direct raw
E:\vstorage\nbd> qemu-storage-daemon --blockdev driver=file,node-name=disk,filename=H:\gkpics01.raw.img --nbd-server addr.type=inet,addr.host=127.0.0.1,addr.port=10809 --export type=nbd,id=export,node-name=disk,name=gkpics01,writable=on

With a vdisk-format
qemu-storage-daemon --blockdev driver=file,node-name=file,filename=H:\gkpics01.qcow2 --blockdev driver=qcow2,node-name=qcow2,file=file --nbd-server addr.type=inet,addr.host=127.0.0.1,addr.port=10809 --export type=nbd,id=export,node-name=qcow2,name=gkpics01,writable=on

Starting a qemu-VM in Windows

qemu-system-x86_64 -cpu qemu64 -m 4096 -machine "type=q35,kernel-irqchip=off" -accel whpx -smp "sockets=1,cores=8,threads=1" -boot c -drive "index=0,if=virtio,media=disk,format=raw,file=H:\m01_bootnoefi.raw.img" -drive "index=1,if=virtio,media=disk,format=raw,file=H:\m01_lnx.raw.img" -drive "index=2,if=virtio,media=disk,format=qcow2,file=H:\gkpics01.qcow2"  -drive "index=3,if=virtio,media=disk,format=vhdx,file=E:\test\sgdata.vhdx" -display gtk -vga virtio -rtc base=utc -netdev user,id=vmnic1,net=192.168.20.0/24,dns=192.168.20.3,dhcpstart=192.168.20.15,hostfwd=tcp::9551-:22 -device virtio-net,netdev=vmnic1 -chardev qemu-vdagent,id=ch1,name=vdagent,clipboard=on -device virtio-serial -device virtserialport,chardev=ch1,id=ch1,name=com.redhat.spice.0 -qmp "tcp:127.0.0.1:5955,server,nowait"

Attach nbd drive in windows using wnbd

PS E:\vstorage\nbd\wnbd_client> wnbd-client.exe map --hostname localhost --port 10809 --instance-name gkpics01

Mount/unmount in Linux-VM

[root@fedora gana]# mount -t ntfs3 /dev/vdd2 /mnt/b
[root@fedora gana]# mount -t ntfs3 /dev/vdc1 /mnt/a
[root@fedora gana]# umount /mnt/a
[root@fedora gana]# umount /mnt/b

The ntfs3 fs-type argument is important, as Fedora 36 Linux, by default uses fuseblk driver

[root@fedora gana]# cat /etc/mtab | grep vdc1
without -t ntfs3 fstype-argument
/dev/vdc1 /mnt/a fuseblk rw,relatime,user_id=0,group_id=0,allow_other,blksize=4096 0 0
vs with -t ntfs3 fstype argument
/dev/vdc1 /mnt/a ntfs3 rw,relatime,uid=0,gid=0,iocharset=utf8 0 0

Add/Remove drive letter in Windows before unmapping

Get-Volume -FriendlyName fs_gkpics01 | Get-Partition | Add-PartitionAccessPath -accesspath "K:"
Get-Volume -Drive "K" | Get-Partition | Remove-PartitionAccessPath -accesspath "K:"
Set-Disk -Number $diskNumber -IsOffline $true

Do copy/write in Linux-VM

[root@fedora gana]# ( fl="photos001" ; src="/mnt/b" ; dst="/mnt/a" ; sdate=`date` ; cd "$src" ; rsync -avH "$fl" "$dst"  ; echo "$sdate" ; date ; sudo -u gana DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/1000/bus DISPLAY=:0.0 -- notify-send "$src/$fl" "rsync $src/$fl" )

Do copy/write in windows

Use windows explorer gui in order to copy paste data from src folder to dst folder Using xcopy:

ps> pwsh -Command { $sdate=$(date) ; echo "$sdate" ; xcopy /E E:\test\sgdata K:\ ; echo "$sdate" ; echo "$(date)" }
: 
15377 File(s) copied
09/25/2022 15:24:13
09/25/2022 15:45:21

Sync command in windows, from sysinternal command line utilities

PS C:\> sync K:
Sync v2.2 - Flush cached data to disk.
Copyright (C) 2016 Mark Russinovich
Sysinternals - www.sysinternals.com
Flushing: K

Sync command on Linux-VM

[root@fedora gana]# sync

Unmap/disconnect/detach drive in windows

PS E:\vstorage\nbd\wnbd_client> .\wnbd-client.exe unmap gkpics01
libwnbd.dll!WnbdRemoveEx WARNING Soft device removal failed. Falling back to hard removal. This happens even if I do Drive-letter-removal in diskmgmt

unmapping drive takes about 30 seconds before anything happens. Note the error message.

Verify copied data in Linux-VM

[root@fedora gana]# # verification under linux-VM
[root@fedora gana]# mount /dev/vdc1 /mnt/a
[root@fedora gana]# ( fl="photos001" ; rtpt="/mnt/a" ; cd "${rtpt}/${fl}" ; sdate=`date` ; echo "$sdate" ; sha256sum -c "$rtpt/$fl/find.CHECKSUM" --quiet ; echo "$sdate" ; date ; sudo -u gana DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/1000/bus DISPLAY=:0.0 -- notify-send "$src/$fl" "checksum $src/$fl" )
Sun Sep 25 11:51:34 AM IST 2022
Sun Sep 25 11:51:34 AM IST 2022
Sun Sep 25 12:13:53 PM IST 2022

Verify copied data in windows

PS C:\Users\gana> powershell -Command { $fl="photos001" ; $rtpt="K:" ; cd "${rtpt}/${fl}" ; $sdate=$(date) ; echo "$sdate" ; C:\vol\scoop_01\scoopg\apps\git\current\usr\bin\sha256sum.exe -c "$rtpt/$fl/find.CHECKSUM" --quiet ; echo "$sdate" ; echo "$(date)"  }
C:\Users\gana\OneDrive\Documents\WindowsPowerShell
09/25/2022 15:55:43
09/25/2022 15:55:43
09/25/2022 16:07:55

The error message during Unmap/disconnect/detach drive in windows happens even if I manually remove-drive-letter in diskmgmt-gui before unmapping

Q1) how can one unmount the drive without any worry causing error

Tests [01,02,03,04] were done using fixed-size raw image. I will do more tests. Try and see what happens with expanding virtual-disks, but as you say, that puts wnbd partly off the hook. However, the wnbd driver should still handle it without lockup-disaster even if the nbd server is slow to catch up.

petrutlucian94 commented 2 years ago

Just wondering: have you attached the same image to a windows host and linux vm simultaneously? That's likely going to corrupt the ntfs filesystem. In general, only a cluster aware filesystem can be safely attached to multiple hosts without risking data corruption (e.g. Windows Cluster Shared Volume, but that requires persistent SCSI reservations, which we don't support yet).

Q1) how can one unmount the drive without any worry causing error

Wnbd tries to do a "soft" unmap first, which notifies the Windows storage stack the the disk is going to be removed, letting other drivers flush their data before proceeding with the unmap. By default, we'll wait 30s before doing a hard remove as a fallback. If --debug is specified, wnbd-client.exe will specify which driver is preventing the disk from being unmapped.

Here are the parameters that unmap can receive:

Usage: wnbd-client unmap <instance-name>
                         [--hard-disconnect] [--no-hard-disconnect-fallback]
                         [--soft-disconnect-timeout <arg>]
                         [--soft-disconnect-retry-interval <arg>]

Remove disk mapping.

Positional arguments:
  <instance-name>                          Disk identifier.

Optional arguments:
  [--hard-disconnect]                      Perform a hard disconnect. Warning:
                                           pending IO operations as well as
                                           unflushed data will be discarded.
  [--no-hard-disconnect-fallback]          Immediately return an error if the
                                           soft disconnect fails instead of
                                           attempting a hard disconnect as
                                           fallback.
  [--soft-disconnect-timeout <arg>]        Soft disconnect timeout in seconds.
                                           The soft disconnect operation uses
                                           PnP to notify the Windows storage
                                           stack that the device is going to be
                                           disconnected. Storage drivers can
                                           block this operation if there are
                                           pending operations, unflushed caches
                                           or open handles. Default: 30.
  [--soft-disconnect-retry-interval <arg>] Soft disconnect retry interval in
                                           seconds. Default: 2.

Common arguments:
  [--debug]                                Enable debug logging.

hgkamath commented 2 years ago

Nope, I am certain to make sure the raw-image isn't simultaneously used by nbd-client and VM. By default, qemu-nbd is set to disconnect when a client disconnects. (There another option to let it remain online for a re-connection. -t, --persistent don't exit on the last connection, but I haven't done that).

I make sure to detach drive/image from windows, before starting qemu-VM.
And before returning to Windows-wnbd testing, I shutdown the qemu-VM completely (systemctl poweroff), make sure qemu-system-x86_64 has given back control to cmd-prompt before I start qemu-nbd preparing for windows attachment.

Yes, I know it would be very bad if two non-co-operating processes use and write-to the image simultaenously, not my intent.

There could be a bug there if soft-disconnect is not happening. I don't need to do any data transfer for that to happen. Just

start qemu-nbd,
map drive using wnbd
- Check diskmgmt.msc to make sure a drive letter is allocated
Unmap drive using wnbd
- Doesn't matter if drive-letter is removed inside diskmgmt. same error either case

In my tests, there is no data error, on account of hard-disconnect. I ensure to use sync command before unmapping. But then, as I said, it is a little worry.

In the end goal, I envision to have, a 2Tb disk with many 100Gb vdisks on it, to be attached on a need basis.

some vhdx vdisks to be used windows only. diskmgmt can attach those directly. Ex scoop, windows-progs, (Hopefully I replace all with qcow2 as in 2.
some qcow2 vdisks may provide data-vdisks to windows on a need basis, using wnbd.
most qcow2 vdisks are be attached to VM, mounted and used within the VM, and then perhaps just use samba/CIFS if it is required to share any data folder to the host-OS

Ideally, I wanna do away with different formats, and go with the ONE TRUE VDISK FORMAT that works well with Windows-OS or Linux-OS, be it hyperv or kvm/qemu, be it attached to HOST-OS or Guest-VM-OS

If I go with vhdx, qemu-implementation seems to have corruption problems
If I go with qcow2, there is the secureboot issue, apart from assuring myself that qemu-nbd/wnbd is safe/reliable.
if I go with vmdk, qemu vmdk implementation not performant, plus on windows only ImDIsk-Toolkit is only way to mount vmdk (free-maybe-opensource-tool).

I hope to make my system as follows.

Dual Boot
- Windows-Host
- hyperv/qemu running a Linux-VM
- Linux-Host
- kvm/qemu running same Linux-VM

In the comment before last, you mentioned to get rid of qemu. Did you have another open-source server in mind for me to try on the windows host. I thought ceph-server needs to run on a linux-host. Also, the whole cluster-FS is too much overkill for a single user single machine system.

petrutlucian94 commented 2 years ago

There could be a bug there if soft-disconnect is not happening.

Well, if there's another driver from the storage stack that's preventing us from doing a soft disconnect, there's not much we can do. We can either return an error or do a force unmap, which can be configured using the unmap parameters. Again, it might be useful to know who's holding the disk (using the --debug parameter).

In the comment before last, you mentioned to get rid of qemu. Did you have another open-source server in mind for me to try on the windows host. I thought ceph-server needs to run on a linux-host. Also, the whole cluster-FS is too much overkill for a single user single machine system.

indeed, ceph was definitely not intended for single host environments. if qemu-nbd causes problems, you may also consider lvm + iscsi or even zfs + iscsi.

hgkamath commented 2 years ago

A dummy map and unmap cycle: As I had removed drive letter in diskmgmt in a previous attempt, when wnbd-map command attaches the drive, diskmgmt remembers not to assign a drive letter. So on this trial run, there is no need to even do sync or remove-drive-letter. Nevertheless, unmap soft removal fails. All commands are executed under a minute.

PS E:\vstorage\nbd\wnbd_client> wnbd-client.exe map --hostname localhost --port 10809 --instance-name gkpics01 --debug
libwnbd.dll!WnbdOpenAdapterEx DEBUG Opening WNBD adapter device: \\?\root#scsiadapter#0000#{949dd17c-b06c-4c14-8e13-f1a3a4a6dbcb}

PS C:\> wnbd-client.exe list
Pid         DiskNumber  Nbd    Owner            InstanceName
2936        10          true   wnbd-client      gkpics01

PS C:\> wnbd-client.exe stats --instance-name gkpics01
Disk stats
TotalReceivedIORequests        : 104
TotalSubmittedIORequests       : 104
TotalReceivedIOReplies         : 104
UnsubmittedIORequests          : 0
PendingSubmittedIORequests     : 0
AbortedSubmittedIORequests     : 0
AbortedUnsubmittedIORequests   : 0
CompletedAbortedIORequests     : 0
OutstandingIOCount             : 0

PS E:\vstorage\nbd\wnbd_client> wnbd-client.exe unmap --instance-name gkpics01 --debug
libwnbd.dll!WnbdOpenAdapterEx DEBUG Opening WNBD adapter device: \\?\root#scsiadapter#0000#{949dd17c-b06c-4c14-8e13-f1a3a4a6dbcb}
libwnbd.dll!WnbdOpenAdapterEx DEBUG Opening WNBD adapter device: \\?\root#scsiadapter#0000#{949dd17c-b06c-4c14-8e13-f1a3a4a6dbcb}
libwnbd.dll!PnpRemoveDevice DEBUG Could not remove device. CM status: 23, veto type 5, veto name: SCSI\Disk&Ven_WNBD&Prod_WNBD_DISK\1&2afd7d61&2&000500. Time elapsed: 5.82s
libwnbd.dll!PnpRemoveDevice DEBUG Could not remove device. CM status: 23, veto type 5, veto name: SCSI\Disk&Ven_WNBD&Prod_WNBD_DISK\1&2afd7d61&2&000500. Time elapsed: 13.63s
libwnbd.dll!PnpRemoveDevice DEBUG Could not remove device. CM status: 23, veto type 5, veto name: SCSI\Disk&Ven_WNBD&Prod_WNBD_DISK\1&2afd7d61&2&000500. Time elapsed: 21.45s
libwnbd.dll!PnpRemoveDevice DEBUG Could not remove device. CM status: 23, veto type 5, veto name: SCSI\Disk&Ven_WNBD&Prod_WNBD_DISK\1&2afd7d61&2&000500. Time elapsed: 29.31s
libwnbd.dll!PnpRemoveDevice DEBUG Could not remove device. CM status: 23, veto type 5, veto name: SCSI\Disk&Ven_WNBD&Prod_WNBD_DISK\1&2afd7d61&2&000500. Time elapsed: 37.09s
libwnbd.dll!WnbdRemoveEx WARNING Soft device removal failed. Falling back to hard removal.

petrutlucian94 commented 2 years ago

That's useful information, the veto type is set to 5 (PNP_VetoOutstandingOpen) [1], so there's probably an open handle.

I usually use Process Hacker or the sysinternals tools to identify open handles. You could also try setting the disk offline before unmapping it:

Set-Disk -Number $diskNumber -IsOffline $true

[1] https://learn.microsoft.com/en-us/windows/win32/api/cfg/ne-cfg-pnp_veto_type#constants

hgkamath commented 2 years ago

I determined what the cause is.

I usually keep a admin privilege taskmgr.exe process running for emergency troubleshooting. If taskmgr is closed, soft terminations happens. Why does taskmgr do that? It probably shouldn't

The EventManager-LogViewer eventvwr.msc had the following info

map:
warn 4:37:15 NbdNegotiate:296 Ignoring unsupported NBD reply info type: 3
Info 4:37:15 Volume \\?\Volume{27ec1801-27fc-4877-b1dc-0b3a137bceb5} (\Device\HarddiskVolume76) is healthy.  No action is needed.

unmap:
Warn 4:37:36 The application \Device\HarddiskVolume4\Windows\System32\Taskmgr.exe with process id 36964 stopped the removal or ejection for the device SCSI\Disk&Ven_WNBD&Prod_WNBD_DISK\1&2afd7d61&2&000700.
Info 4:37:39 Volume \\?\Volume{27ec1801-27fc-4877-b1dc-0b3a137bceb5} (\Device\HarddiskVolume76) is healthy.  No action is needed.
Warn 4:37:45 The application \Device\HarddiskVolume4\Windows\System32\Taskmgr.exe with process id 36964 stopped the removal or ejection for the device SCSI\Disk&Ven_WNBD&Prod_WNBD_DISK\1&2afd7d61&2&000700.
Info 4:37:45 Volume \\?\Volume{27ec1801-27fc-4877-b1dc-0b3a137bceb5} (\Device\HarddiskVolume76) is healthy.  No action is needed.
Warn 4:37:53 The application \Device\HarddiskVolume4\Windows\System32\Taskmgr.exe with process id 36964 stopped the removal or ejection for the device SCSI\Disk&Ven_WNBD&Prod_WNBD_DISK\1&2afd7d61&2&000700.
Info 4:37:55 Volume \\?\Volume{27ec1801-27fc-4877-b1dc-0b3a137bceb5} (\Device\HarddiskVolume76) is healthy.  No action is needed.
Warn 4:38:01 The application \Device\HarddiskVolume4\Windows\System32\Taskmgr.exe with process id 36964 stopped the removal or ejection for the device SCSI\Disk&Ven_WNBD&Prod_WNBD_DISK\1&2afd7d61&2&000700.
Info 4:38:01 Volume \\?\Volume{27ec1801-27fc-4877-b1dc-0b3a137bceb5} (\Device\HarddiskVolume76) is healthy.  No action is needed.
Warn 4:38:08 The application \Device\HarddiskVolume4\Windows\System32\Taskmgr.exe with process id 36964 stopped the removal or ejection for the device SCSI\Disk&Ven_WNBD&Prod_WNBD_DISK\1&2afd7d61&2&000700.
Warn 4:38:08 NbdReadExact:36 Failed with : 0
Err  4:38:38 NbdReadReply:490 Could not read command reply.
Warn 4:38:38 Disk 10 has been surprise removed.

Same thing happens, even if volume/disk is set to offline using the Set-Disk command. diskmgmt shows disk having gone offline.

Apparently, that's an M$ thing

https://superuser.com/questions/1338161/why-is-task-manager-using-my-usb-drive

petrutlucian94 commented 2 years ago

If taskmgr is closed, soft terminations happens. Why does taskmgr do that? It probably shouldn't

taskmgr provides various disk IO statistics, so I assume it holds some disk handles. That's normal behavior, nice catch though.

Warn 4:38:08 NbdReadExact:36 Failed with : 0 Err 4:38:38 NbdReadReply:490 Could not read command reply.

That message gets logged when an nbd connection is closed. I've added a TODO item to avoid logging an error since that can be confusing.

hgkamath commented 1 year ago

20230414 wnbd/qsd: Documenting a stuck on xcopy situation for reproduction

It is random in the sense it seemes to happen once every two or three runs.
It does not happen on windows through win-qemu-VM with attached qcow2-dyn/qemu-fixed disk
It does not happen on Linux kernel-6.2.10 with lnx-qemu-qsd-8.0.0-rc3 with attached qcow2-dyn/qemu-fixed disk

PS C:\Windows\system32>  D:\vstorage\nbd\wnbd_client\wnbd-client.exe version
wnbd-client.exe: 0.4.1-10-g5c5239c
libwnbd.dll: 0.4.1-10-g5c5239c
wnbd.sys: 0.4.1-10-g5c5239c

Copy test

Windows-10-22H2 19045.2846
qemu-storage-daemon-8.0.0-rc3 qemu-w64-setup-20230411.exe https://qemu.weilnetz.de/w64/
wnbd 0.4.1-10-g5c5239c
src=hdd/winntfs (synthetic gen data as described in https://gitlab.com/qemu-project/qemu/-/issues/727#note_739930694)
dst=ssd/winntfs/qcow2-fixed(qemu-img)/qsd/wnbd/winntfs

qemu-storage-command D:\tmpq\qemu\8.0.0-rc3\qemu\qemu-storage-daemon.exe --blockdev driver=file,node-name=file,filename=F:\gkpics01.qcow2 --blockdev driver=qcow2,node-name=qcow2,file=file --nbd-server addr.type=inet,addr.host=127.0.0.1,addr.port=10809 --export type=nbd,id=export,node-name=qcow2,name=gkpics01,writable=on
F:\gkpics01.qcow2 is qcow2-fixed 93GiB (prepare this qcow2-fixed vdisk file ahead of time in a linux machine, windows qemu-img as of today cannot create qcow2-fixed)
wnbd map : D:\vstorage\nbd\wnbd_client\wnbd-client.exe map --hostname localhost --port 10809 --instance-name gkpics01
In diskmgmt, initialize gpt and make 1 ntfs volume, quick formatted with label fs_gkpics01
Perform data copy test with xcopy command pwsh -Command { $sdate=$(date) ; echo "$sdate" ; xcopy /E D:\test\sgdata\photos001 J:\photos001 ; echo "$sdate" ; echo "$(date)" } . The OS becomes stuck about 5 minutes into the roughly 15 minute operation. In sysinternals-procexp app, disk-usage-graph goes down from MBps to Bps.
```
:
D:\test\sgdata\photos001\D0014_127c4d62\File_0000.bin
... OS/xcopy is stuck here, below is after unstucking
File creation error - The system cannot find the path specified.

04/14/2023 17:07:12
04/14/2023 17:12:59
```

System stuck happens when the rsync process reaches a particular file, which is a huge >4GiB file. I want to weaken this statement and say its more likely when doing that file, but in case it doesn't, then xcopy progresses along, but stuck is likely to happen on another large file.

PS C:\Windows\system32> dir D:\test\sgdata\photos001\D0014_127c4d62\
   Directory: D:\test\sgdata\photos001\D0014_127c4d62
Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
-a----        11/22/2021   3:13 PM     4294694852 File_0000.bin

To unstuck the situation: In a cmd window, run wnbd-client uninstall-driver. The qemu-storage-daemon which was silent till now, will be disrupted by the force removed disk will go (about 20 times)
```
D:\tmpq\qemu\8.0.0-rc3\qemu\qemu-storage-daemon.exe: Disconnect client, due to: Failed to send reply: Unable to write to socket: Input/output error
D:\tmpq\qemu\8.0.0-rc3\qemu\qemu-storage-daemon.exe: Disconnect client, due to: Failed to send reply: Unable to write to socket: Input/output error
:
```
The qemu-storage-daemon, however, still is in running condition, and will remain so, unless one presses Ctrl-C.
The xcopy will say File creation error - The system cannot find the path specified.
As the qemu-storage-deamon is still running , it is possible to reinstall wnbd-driver, map-volume, format-volume and restart xcopy from the beginning, This sort of indicates that the qemu-storage-deamon nbd-server is responsive.

Event log

In the below, who is \Device\RaidPort4? that raises event ID-129.
I don't even use raid.
Laptop has 3 disks sata-ssd, sata-hdd and sata-hdd via a msata-cdrom-caddy.
I later gather this raidport# number is whatever wnbd attaches itself as on the underside to storport.
Looking into the eventlog filter for event-id-129, source is usually wnbd.
Maybe, the number of times it shows up corresponds to number of times, I've let system get stuck by this test.
In the past, wnbd has raised event-id-129 as raidport1, raidport6, raidport10, so it depends on what it gets assigned during driver insertion.
The binary data payload seems to be the same for each event-id-129.

The below log is from a different run.

D:\test\sgdata\photos001\D0020_bf23a7e9\File_0452.bin
File creation error - A device which does not exist was specified.

04/19/2023 13:57:35
04/19/2023 14:07:43

I looked at eventvwr.msc Windows-logs/System category.
This is the first event after a event-lul period , about 9 minutes into the xcopy step.

Log Name:      System
Source:        wnbd
Date:          4/19/2023 2:06:34 PM
Event ID:      129
Task Category: None
Level:         Warning
Keywords:      Classic
User:          N/A
Computer:      Capella-PC
Description:
The description for Event ID 129 from source wnbd cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.

If the event originated on another computer, the display information had to be saved with the event.

The following information was included with the event: 

\Device\RaidPort4

The message resource is present but the message was not found in the message table

Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="wnbd" />
    <EventID Qualifiers="32772">129</EventID>
    <Version>0</Version>
    <Level>3</Level>
    <Task>0</Task>
    <Opcode>0</Opcode>
    <Keywords>0x80000000000000</Keywords>
    <TimeCreated SystemTime="2023-04-19T08:36:34.8522246Z" />
    <EventRecordID>277544</EventRecordID>
    <Correlation />
    <Execution ProcessID="4" ThreadID="23460" />
    <Channel>System</Channel>
    <Computer>Capella-PC</Computer>
    <Security />
  </System>
  <EventData>
    <Data>\Device\RaidPort4</Data>
    <Binary>0F001800010000000000000081000480010000000000000000000000000000000000000000000000000000000000000000000000810004800000000000000000</Binary>
  </EventData>
</Event>

After this, there are many disk io errors, like so: There are 4 of these at 2:06:34

Log Name:      System
Source:        disk
Date:          4/19/2023 2:06:34 PM
Event ID:      153
Task Category: None
Level:         Warning
Keywords:      Classic
User:          N/A
Computer:      Capella-PC
Description:
The IO operation at logical block address 0x5f3038 for Disk 11 (PDO name: \Device\000000c0) was retried.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="disk" />
    <EventID Qualifiers="32772">153</EventID>
    <Version>0</Version>
    <Level>3</Level>
    <Task>0</Task>
    <Opcode>0</Opcode>
    <Keywords>0x80000000000000</Keywords>
    <TimeCreated SystemTime="2023-04-19T08:36:34.8522246Z" />
    <EventRecordID>277545</EventRecordID>
    <Correlation />
    <Execution ProcessID="4" ThreadID="23460" />
    <Channel>System</Channel>
    <Computer>Capella-PC</Computer>
    <Security />
  </System>
  <EventData>
    <Data>\Device\Harddisk11\DR34</Data>
    <Data>0x5f3038</Data>
    <Data>11</Data>
    <Data>\Device\000000c0</Data>
    <Binary>0F01040004002C0000000000990004800000000000000000000000000000000000000000000000000000022A</Binary>
  </EventData>
</Event>

This is after the reset adapter

Log Name:      System
Source:        disk
Date:          4/19/2023 2:07:39 PM
Event ID:      157
Task Category: None
Level:         Warning
Keywords:      Classic
User:          N/A
Computer:      Capella-PC
Description:
Disk 11 has been surprise removed.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="disk" />
    <EventID Qualifiers="32772">157</EventID>
    <Version>0</Version>
    <Level>3</Level>
    <Task>0</Task>
    <Opcode>0</Opcode>
    <Keywords>0x80000000000000</Keywords>
    <TimeCreated SystemTime="2023-04-19T08:37:39.6266117Z" />
    <EventRecordID>277550</EventRecordID>
    <Correlation />
    <Execution ProcessID="4" ThreadID="24996" />
    <Channel>System</Channel>
    <Computer>Capella-PC</Computer>
    <Security />
  </System>
  <EventData>
    <Data>\Device\Harddisk11\DR34</Data>
    <Data>11</Data>
    <Binary>0000000002003000000000009D000480000000000000000000000000000000000000000000000000</Binary>
  </EventData>
</Event>

Events that come after the previous: There are many ntfs-delayed-write-failed and application-popup-delayed-write-failed events, which are to be expected after disk removal.

hgkamath commented 1 year ago

Q) does the above eventvwr wnbd event-id-129 and its binary payload provide any debugging clue?

As in case of Azure link below, the wnbd interface to the network block device is an emulated device, and there no real physical scsi hardware for failure to happen. Perhaps if some nbd-communication times-out, it assumes a 129. Perhaps there is some file system maintenance for file-expansion or something that kicks in to cause the delay. Note: mine is a haswell quadcore machine, nbd-server and storport are colocated, and might not be as fast as machines you use.

Apparently, HyperV also had similar resets, which needed to be handled, as they were harmless emulated timing issues. https://support.microsoft.com/en-us/topic/many-id-129-events-are-logged-when-you-run-a-windows-server-2012-r2-virtual-machine-in-windows-azure-a9c9d4eb-7533-2704-6049-7428bbb565cb

qemu-storage-daemon, does not do any disk virtualization/physical-disk-emulation/bus-protocol-emulation, and its file/virtual-disk/block code processes data back-and-forth direct to nbd-export. Unlike qemu-nbd, which by default halts on client-connection-drop and requires a persistent command-line-argument to tell not to terminate, the qemu-storage-daemon process runs (persistent by default) until one does ctrl-C on it. So it should be possible to reconnect.

petrutlucian94 commented 1 year ago

It just means that the nbd connection hanged and storport is issuing LUN resets.

I'll need to:

reproduce this on my env to see why NBD connections hang
reconnect to the NBD server when resetting the LUN

petrutlucian94 commented 1 year ago

I've replicated the issue using qemu-storage-daemon.exe. The NBD connection eventually hangs and we can no longer send requests or receive replies. Windows becomes extremely unresponsive when that happens.

This doesn't affect rbd-wnbd, so the potential causes are:

nbd client issue
- nbd protocol implementation
- WSK or libksocket bug
nbd server bug

Thanks again for the bug report, I'm reopening the issue. Sorry for the slow progress, our main focus is currently Ceph, we're providing the NBD client functionality as a best effort.

LE: NBD socket reads and writes hang indefinitely: https://pastebin.com/raw/6rrggyDs. We'll move the NBD client to userspace as soon as possible.

LLE: there seems to be a problem with qemu-storage-daemon on Windows. Wnbd works fine with Linux qemu-storage-daemon NBD servers or the standard Linux nbd-server.

I did prepare a draft that moves the NBD implementation to userspace, making it easier to maintain: https://github.com/cloudbase/wnbd/pull/122. I'll probably prepare another PR that forwards LUN resets to the wnbd daemon so that it may reset or remove the connection instead of blocking the host.

Looks like qemu's blk_co_pwrite hangs at some point: https://github.com/qemu/qemu/blob/6972ef1440a9d685482d78672620a7482f2bd09a/nbd/server.c#L2514. This might also affect Qemu VMs on Windows.

petrutlucian94 commented 1 year ago

@hgkamath I spent some time looking at qemu-storage-daemon, some WriteFile calls were hanging. The good news is that it seems to work fine when passing cache.direct=on, for example:

qemu-storage-daemon.exe  --blockdev driver=file,node-name=file,filename=e:\img\test_100gb.raw,cache.direct=on --blockdev driver=raw,node-name=raw,file=file --nbd-server addr.type=inet,addr.host=127.0.0.1,addr.port=10809 --export type=nbd,id=export,node-name=raw,name=test_100gb,writable=on

hgkamath commented 1 year ago

I'll try that out.

One thought, that I wanted to pass along, was that
I discovered qemu-win also uses mingw-winpthreads by default.
There is a bug, which I filed myself, https://gitlab.com/qemu-project/qemu/-/issues/1605
in which I found that the workaround is to use native threads with aio=native instead of the default mingw-winpthreads, the implicit option being aio=threads

I have encountered that particular qemu-bug only with rapid expansion of VHDX-dynamic vdisk-file (vhdx-dyn). The chance that it surfaces here is slim because, this test uses qcow2-fixed. a) for starters, qemu-s implementation of its codenative format qcow2 is much more performant than VHDX b) being a fixed vdisk, the block-allocation is complete and so this bug's vdisk file does not have to expand.

Also, your comment, also confirms it happens for raw format itself. Indeed, one should test in raw. This bug is independent of the qemu's vdisk format, and it is best to choose raw, to remove uncertainties arising due to qemu's vdisk implementations.

Nevertheless, I now test with a drive option aio=native just to reduce the chance of that. PS C:\Users\admin> qemu-storage-daemon.exe --blockdev driver=file,aio=native,node-name=file,filename=F:\gkpics01.qcow2 --blockdev driver=qcow2,node-name=qcow2,file=file --nbd-server addr.type=inet,addr.host=127.0.0.1,addr.port=10809 --export type=nbd,id=export,node-name=qcow2,name=gkpics01,writable=on

PS C:\Windows\system32> $(dir F:\gkpics01.qcow2).Length
99739238400
PS C:\Windows\system32> fsutil sparse queryflag F:\gkpics01.qcow2
This file is NOT set as sparse

Expectedly, I still did experience a hang with aio=native also.

In one of the hangs, it was possible to do an uninstall-driver, install-driver, map and resume/restart the xcopy without having to Ctrl-C the qemu-storage-daemon, meaning, that in that run, the qemu-storage-daemon nbd-server was not stuck and nbd reconnection succeeded.

From the qemu manual pdf pg41 :
cache.direct : The host page cache can be avoided with cache.direct=on. This will attempt to do disk IO directly to the guest’s memory. QEMU may still perform an internal copy of the data.
For cache.direct to work on windows. the file-open-flag O_DIRECT should should be supported on windows/mingw/win32 api, also the filesystem and filesystem-driver should support it.
win32 file has FILE_FLAG_NO_BUFFERING for CreateFile, ReOpenFile, file-buffering
FILE_FLAG_NO_BUFFERING maps to the O_DIRECT flag for the Linux open command.
So, it does. block/file-win32.c does use the flag.
In file-win32.c/file-posix.c, the function raw_parse_flags() sets the corresponding flags based on if BDRV_O_NOCACHE is set, which is set if cache.direct option is provided.
... just checking.

Its a good idea to quickly detect stuck situation, and react by uninstalling driver.
Superstition: I think, leaving machine unattended and risking it remaining in this stuck-state can lead to machine becoming slower and eventually having a more serious system-crash.
Techniques I use to quickly detect a stuck-situation

Usually, for small files, one can see xcopy progressing. That clearly means system not stuck.
For larger files, xcopy takes a long time on the file. So, if one keeps a notepad open and alternatively press-space-bar and ctrl-S (save) When machine is not stuck, the unsaved asterix indicator in the titlebar will toggle. When machine becomes suck, notepad becomes unresponsive on save.
IDEA-FAIL: Run the below script in powershell. It should not block on file-access, and keep printing the time every 6 seconds. Not that I expect the file-size to change.
PS C:\Windows\system32> $i=0 ; while ($true -and ($i -lt 400)) { "{0} {1}" -f $(date),$(dir F:\gkpics01.qcow2).Length ; sleep 6 ; $i = $i + 1 }
The above turned out to be ineffective. So I tried the below
PS C:\Windows\system32> $i=0 ; while ($true -and ($i -lt 400)) { "{0} {1} {2}" -f $(date),$(dir F:\gkpics01.qcow2).Length,$(dir F:\gkpics01.qcow2).LastWriteTime[0].ToString() ; "test" | Out-File -FilePath K:\tmpq\Downloads\a.txt ; sleep 6 ; $i = $i + 1 } But with this, I could not reproduce the stuck-state, it was acting like a keep-alive.

Another superstition that I am still watching out for is that I think if stuck happens its most likely at the really large files. Its not always the same file, but I will need to see if it is always one of the below large ones.
EDIT: This is maybe just that, a superstition, I started noting the ending file's file-size, and neither its, or the next's was always large, sometimes files were as small as 3Mib.
>512 MiB

D:\test\sgdata>forfiles /S /M * /C "cmd /c if @fsize GEQ 536870912 echo @path @fsize"
"D:\test\sgdata\photos001\D0011_325a7df4\D0004_ff4697dc\File_0000.bin" 870548550
"D:\test\sgdata\photos001\D0011_325a7df4\D0007_cfa36bee\File_0000.bin" 917280712
"D:\test\sgdata\photos001\D0011_325a7df4\D0007_cfa36bee\File_0001.bin" 1193905888
"D:\test\sgdata\photos001\D0014_127c4d62\File_0000.bin" 4294694852
"D:\test\sgdata\photos001\D0015_a42d4363\File_0000.bin" 2381007674
"D:\test\sgdata\photos001\D0015_a42d4363\File_0001.bin" 2103925866
"D:\test\sgdata\photos001\D0016_bcd34db1\File_0012.bin" 1223441112
"D:\test\sgdata\photos001\D0016_bcd34db1\File_0014.bin" 1498293582
"D:\test\sgdata\photos001\D0016_bcd34db1\File_0019.bin" 800657438
"D:\test\sgdata\photos001\D0017_6986e32f\File_0000.bin" 1889579172
"D:\test\sgdata\photos001\D0017_6986e32f\File_0001.bin" 1978039382
"D:\test\sgdata\photos001\D0020_bf23a7e9\File_1824.bin" 885577878
"D:\test\sgdata\photos001\D0020_bf23a7e9\File_1849.bin" 758393712
"D:\test\sgdata\photos001\D0020_bf23a7e9\File_1861.bin" 622607906
"D:\test\sgdata\photos001\D0020_bf23a7e9\File_1867.bin" 559405836
"D:\test\sgdata\photos001\D0021_1ab60b0c\File_1027.bin" 790637902
"D:\test\sgdata\photos001\D0021_1ab60b0c\File_1030.bin" 859589186
"D:\test\sgdata\photos001\D0024_fded2015\D0000_2b73ea53\File_0030.bin" 586790162
"D:\test\sgdata\photos001\D0024_fded2015\D0000_2b73ea53\File_0032.bin" 710406956
"D:\test\sgdata\photos001\D0024_fded2015\D0000_2b73ea53\File_0034.bin" 632330336
"D:\test\sgdata\photos001\D0024_fded2015\D0000_2b73ea53\File_0036.bin" 662133078
"D:\test\sgdata\photos001\D0024_fded2015\D0000_2b73ea53\File_0040.bin" 1157315322
"D:\test\sgdata\photos001\D0024_fded2015\D0000_2b73ea53\File_0042.bin" 1553586422
"D:\test\sgdata\photos001\D0025_7bd77bae\File_0110.bin" 785962104
"D:\test\sgdata\photos001\D0025_7bd77bae\File_0112.bin" 689221632
"D:\test\sgdata\photos001\D0026_409200f8\File_0034.bin" 797661226
"D:\test\sgdata\photos001\D0026_409200f8\File_0036.bin" 1151132800
"D:\test\sgdata\photos001\D0030_2a29839c\File_0516.bin" 835986114
"D:\test\sgdata\photos001\D0031_b20f40a7\File_0045.bin" 828452346

>1 GiB

D:\test\sgdata>forfiles /S /M * /C "cmd /c if @fsize GEQ 1073741824 echo @path @fsize"

"D:\test\sgdata\photos001\D0011_325a7df4\D0007_cfa36bee\File_0001.bin" 1193905888
"D:\test\sgdata\photos001\D0014_127c4d62\File_0000.bin" 4294694852
"D:\test\sgdata\photos001\D0015_a42d4363\File_0000.bin" 2381007674
"D:\test\sgdata\photos001\D0015_a42d4363\File_0001.bin" 2103925866
"D:\test\sgdata\photos001\D0016_bcd34db1\File_0012.bin" 1223441112
"D:\test\sgdata\photos001\D0016_bcd34db1\File_0014.bin" 1498293582
"D:\test\sgdata\photos001\D0017_6986e32f\File_0000.bin" 1889579172
"D:\test\sgdata\photos001\D0017_6986e32f\File_0001.bin" 1978039382
"D:\test\sgdata\photos001\D0024_fded2015\D0000_2b73ea53\File_0040.bin" 1157315322
"D:\test\sgdata\photos001\D0024_fded2015\D0000_2b73ea53\File_0042.bin" 1553586422
"D:\test\sgdata\photos001\D0026_409200f8\File_0036.bin" 1151132800

hgkamath commented 1 year ago

Confirming that the workaround cache.direct=on gave me a successful run.
Will do future tests both with and without it.

I theorize that be removing the OS/host-caching from the picture, probably increases the responsivity (time to respond to wnbd-client), and hence makes it less likely for the protocol to trip and lead into stuck-state.
Probably WinOS sends processes to sleep state during the time of data going into cache and out.
The question becomes: is it winOS tripping on caching, qemu tripping on caching, or wnbd client tripping due to delay.

PS C:\Windows\system32> qemu-storage-daemon.exe  --blockdev driver=file,aio=native,node-name=file,filename=F:\gkpics01.qcow2,cache.direct=on --blockdev driver=qcow2,node-name=qcow2,file=file --nbd-server addr.type=inet,addr.host=127.0.0.1,addr.port=10809 --export type=nbd,id=export,node-name=qcow2,name=gkpics01,writable=on

PS C:\Windows\system32> powershell -Command { $fl="photos001" ; $rtpt="J:" ; cd "${rtpt}/${fl}" ; $sdate=$(date) ; echo "$sdate" ; C:\vol\scoop_01\scoopg\apps\git\current\usr\bin\sha256sum.exe -c "$rtpt/$fl/find.CHECKSUM" --quiet ; echo "$sdate" ; echo "$(date)"  }
:
15377 File(s) copied
05/18/2023 17:33:06
05/18/2023 17:50:19

# unmap remap
PS C:\Windows\system32> D:\vstorage\nbd\wnbd_client\wnbd-client.exe unmap gkpics01
libwnbd.dll!PnpRemoveDevice WARNING Could not remove device. CM status: 23, veto type 6, veto name: STORAGE\Volume\{50a6c206-f464-11ed-9807-3464a9d34f26}#0000000001000000. Time elapsed: 1.87s, time left: 28s. Check the Windows System event log (Kernel-PnP source) for more information about the process that is holding the disk.
PS C:\Windows\system32>  D:\vstorage\nbd\wnbd_client\wnbd-client.exe map --hostname localhost --port 10809 --instance-name gkpics01
PS C:\Windows\system32>

PS C:\Windows\system32> powershell -Command { $fl="photos001" ; $rtpt="J:" ; cd "${rtpt}/${fl}" ; $sdate=$(date) ; echo "$sdate" ; C:\vol\scoop_01\scoopg\apps\git\current\usr\bin\sha256sum.exe -c "$rtpt/$fl/find.CHECKSUM" --quiet ; echo "$sdate" ; echo "$(date)"  }
05/18/2023 17:50:49
05/18/2023 17:50:49
05/18/2023 18:06:20

petrutlucian94 commented 1 year ago

Thanks for the confirmation. It doesn't seem related to NBD or the wnbd client. Qemu's WriteFile calls [1] are hanging unless we disable caching. I didn't dig too much into qemu's caching related code so I'm not sure if apart from FILE_FLAG_NO_BUFFERING [2] there are any other changes when passing cache-direct=on.

I wouldn't completely exclude some Windows overlapped IO weirdness, we had some similar issues [3].

[1] https://github.com/qemu/qemu/blob/d27e7c359330ba7020bdbed7ed2316cb4cf6ffc1/block/file-win32.c#L86 [2] https://github.com/qemu/qemu/blob/d27e7c359330ba7020bdbed7ed2316cb4cf6ffc1/block/file-win32.c#L281 [3] https://github.com/cloudbase/wnbd/pull/30

hgkamath commented 1 year ago

It feels like we're getting somewhere. Much thanks to your effort.
Most of what I write here I probably just verbiage to what you wrote succinctly.
(also do note that I am a user not a developer, so terms I say might not be correct)

In the qemu-project, on the windows side of things, there are only maintainers and no developers. Windows bugs get fixed if it was a happy side-effect of something else they were doing, usually on linux, or when windows users who have done the troubleshooting, nail something down, point out what-exactly has to be done. There aren't that many advanced-level windows-users.

Just to remember: When drive-lockup/system-stuck happens, only open windows/applications respond, until they are forced to do some disk-access. If an application attempts to do something with the disk, it freezes and becomes unresponsive. Ex notepad, powershell command load, etc.

Either one of following is true: 1) the WriteFile() is just yet another symptom of the drive-lockup/system-stuck (the chain of events leading to drive-lockup being caused elsewhere) , which would be a possibility if, as you said, if cause could be pinned to something cache.direct=on did elsewhere, apart from setting FILE_FLAG_NO_BUFFERING, wherever-else BDRV_O_NOCACHE influences ... and the overlapped-io-weirdness 2) or the WriteFile(), in conjunction with other asynchronous read/write calls, due to some combination of faulty flags and structure-parameters, is causing the block/deadlock, and everything is stuck because of it 3) or the WriteFile() is actively part of a cylic-chain of components that are in deadlock (cyclic-deadlock)

If it is a cyclic deadlock

Then
- wnbd is stuck because qemu nbd is not responsive.
- win10 is stuck and causes drive-lockup because wnbd is stuck.
- qemu Writefile() is stuck because win10 is in drive-lockup.
Maybe, when one gives cache.direct=on the disabling of host-pagefile-caching/file-buffering,
- makes qemu do its write sort-of atomically, which lets it respond to wnbd
- the responsiveness prevents wnbd from making win10 go into drive lockup
- as win10 drive lockup does not happen qemu does not get stuck on write.
If host-pagefile-caching/file-buffering is allowed (default), perhaps
- qemu writefile() is in some half-and-half state of data being committed to disk. qemu is waiting on win10
- but wnbd is also waiting on qemu and causes win10 to go into drive-lockup
- and because win10 went into drive-lockup, the qemu-writefile() isn't going to complete.
When trying to recover from drive-lockup/system-stuck, when wnbd driver is uninstalled.
- J: is force unmounted, Probably a Write() call return in error, xcopy process bails-out/get terminated.
- Win10 is no longer waiting on wnbd, frees the drive-lockup
- In qemu-storage-daemon, the WriteFile() call returns, and so it goes back to being responsive

Maybe,

giving cache.direct=on necessary and one way to avoid the avoid cyclic-deadlock. It obviously works by cutting out win10-OS from the picture, removing one extra copy-in copy-out step. But why does it work? Could it be that under more rigorous OS workloads, deadlock may happen even without file-buffering.
or cache.direct=on is unnecessary.
When xcopy copies a file, assuming xcopy doesn't disable file-buffering in its read()/write() calls, there will be some drive host-pagefile-caching/file-buffering happening at the mounted storage-level (J:). File-buffering in the in-between layers stage (F: and qemu-nbd-block-device emulation) is redundant But even if host-pagefile-caching/file-buffering, though unnecessary, it is allowed to be enabled, one should not run into drive-lockup/system-stuck, there must be another way to to prevent the cyclic-deadlock.

xcopy has a /J option. I'll do some test-runs with that and default cache.direct=off
- Prediction: leaning towards maybe it won't matter.
- ref: learn.microsoft: link
- Result: 2 of 2 testruns succeeded (prediction refuted)
- new hypthesis: the two file-buffering caches might be vying for memory

PS C:\Windows\system32> qemu-storage-daemon.exe --blockdev driver=file,aio=native,node-name=file,filename=F:\gkpics01.qcow2 --blockdev driver=qcow2,node-name=qcow2,file=file --nbd-server addr.type=inet,addr.host=127.0.0.1,addr.port=10809 --export type=nbd,id=export,node-name=qcow2,name=gkpics01,writable=on

PS C:\Windows\system32> xcopy /?
:
 /J           Copies using unbuffered I/O. Recommended for very large files.
:

PS C:\Windows\system32> $sdate=$(date) ; echo "$sdate" ; xcopy /J /E D:\test\sgdata\photos001 J:\photos001 ; echo "$sdate" ; $edate=$(date) ; echo "$edate

15377 File(s) copied
05/19/2023 15:00:32
05/19/2023 15:18:43

PS C:\Windows\system32> powershell -Command { $fl="photos001" ; $rtpt="J:" ; cd "${rtpt}/${fl}" ; $sdate=$(date) ; echo "$sdate" ; C:\vol\scoop_01\scoopg\apps\git\current\usr\bin\sha256sum.exe -c "$rtpt/$fl/find.CHECKSUM" --quiet ; echo "$sdate" ; echo "$(date)" }

05/19/2023 15:23:59
05/19/2023 15:23:59
05/19/2023 15:39:13

Unbuffered write

Seems to be reccommeded when the file being copied is large
- 20110426 How to determine when to use the xcopy with the /j parameter? : link
- quote: Nina Liu MSFT - Unbuffered I/O (or a raw file copy) is preferred when attempting to copy a large file from one location to another when we do not intend to access the source file after the copy is complete. This will avoid the file system cache overhead and prevent the file system cache from being effectively flushed by the large file data.
- In that light, it probably makes sense for VMs, emulated backend devices and the like to always use unbuffered write for its internal block-device emulation: For a qemu-VM, a guest OS will have its own page-cache and filesystem buffering. For WNBD, the windows filesystem layer will do it.

Its also possible that if the nbd-client part of wnbd were jettisoned into userspace, the the OS/kernel would not be insistent that wnbd should respond immediately or within short timebound, allowing for a slower or delayed response from the emulated drive..

Q) Another question is to learn from you. Requesting you to document what you did, and what tools you used. As seen in bug qemu-#1605 so far, I am able to compile my own qemu-8.0 with debuginfo symbols using mingw, and do some very basic troubleshooting using gdb like putting assert() /printf() statements and inspecting stack frame variables.
How did you troubleshoot this? How did you know qemu was stuck on writefile()

petrutlucian94 commented 1 year ago

That's a very interesting observation about the cache dependency. I'd be curious to see what happens when connecting to qemu-storage-daemon from another host/vm (either Linux or Windows), thus avoiding this dependency.

So the idea is that WNBD disk writes are cached and so are the underlying qemu image file writes. This might lead to a deadlock when Windows attempts to flush its cache. I'm not really familiar with the Windows cache manager internals, the guys from OSR (https://community.osr.com) might be able to provide more insight (some Microsoft engineers used to post there). However, it's highly likely that writes are blocked while there's a pending cache flush, despite targeting a different disk.

If this situation is confirmed, we could document something along those lines:

When using WNBD to connect to a local NBD server, make sure to disable
caching either at the WNBD disk level or at the NBD server level. Windows cache
flushes may otherwise deadlock.

For example, when using ``qemu-storage-daemon.exe`` locally, make sure to pass
``cache.direct=on`` like so:

    qemu-storage-daemon.exe  \
      --blockdev driver=file,node-name=file,filename=e:\img\test_100gb.raw,cache.direct=on \
      --blockdev driver=raw,node-name=raw,file=file \
      --nbd-server addr.type=inet,addr.host=127.0.0.1,addr.port=10809 \
      --export type=nbd,id=export,node-name=raw,name=test_100gb,writable=on

Its also possible that if the nbd-client part of wnbd were jettisoned into userspace, the the OS/kernel would not be insistent that wnbd should respond immediately or within short timebound, allowing for a slower or delayed response from the emulated drive..

I've been testing using the userspace nbd client proposed here: https://github.com/cloudbase/wnbd/pull/122. The behavior is the same.

Q) Another question is to learn from you. Requesting you to document what you did, and what tools you used. As seen in bug qemu-#1605 so far, I am able to compile my own qemu-8.0 with debuginfo symbols using mingw, and do some very basic troubleshooting using gdb like putting assert() /printf() statements and inspecting stack frame variables. How did you troubleshoot this? How did you know qemu was stuck on writefile()

I did mostly the same. I was initially considering using Ubuntu to cross compile Qemu for Windows, however it was easier to go with msys2 since it already had all the dependencies (e.g. glib).

Then I checked the nbd server code, looking for the main entry points for NBD IO operations. I was a bit familiar with the qemu block drivers since I wrote a few patches for vhd(x) about 10 years ago :). I knew that raw image file writes should eventually hit this: https://github.com/qemu/qemu/blob/d27e7c359330ba7020bdbed7ed2316cb4cf6ffc1/block/file-win32.c#L86.

I added a few log messages to track IO requests more easily: https://github.com/petrutlucian94/qemu/commit/b18da91174c73cebec8c1a2e8a13726f7c6687c4. Probably I could've just used qemu tracing, however I haven't used that in a while and it was easier to just add a few print messages :). I was running wnbd in debug mode and checking the last 10 requests or so before the deadlock occurred. I was mostly ensuring that the write address, length and nbd tag match the ones received by the nbd server.

Sure, a stack trace would've also told me about the WriteFile hang, however I was also looking for unusual IO patterns. When building qemu using mingw-gcc, the debug symbols use dwarf format, which can't be used by WinDBG. I usually use this gdb fork for opening Windows dumps (the upstream version doesn't support this feature): https://github.com/ssbssa/gdb/releases/tag/gdb-13.1.90.20230220.

Another option might be to compile qemu using mingw-llvm, which has .pdb symbol support, which can be handled by WinDBG.

hgkamath commented 1 year ago

In comment, 1551603847 I had this idea of using a script to detect drive-lock, quickly, visually without manual checking.

PARTIAL-IDEA-FAIL: While xcopy is in progress, run the below script in powershell. It should not block on file-access, and keep printing the time every 6 seconds. Not that I expect the file-size to change.
PS C:\Windows\system32> $i=0 ; while ($true -and ($i -lt 400)) { "{0} {1}" -f $(date),$(dir F:\gkpics01.qcow2).Length ; sleep 6 ; $i = $i + 1 }
The above turned out to be ineffective. It probably got the filesize from RAM/fs-metadata-cache. So I tried the below
PS C:\Windows\system32> $i=0 ; while ($true -and ($i -lt 400)) { "{0} {1} {2}" -f $(date),$(dir F:\gkpics01.qcow2).Length,$(dir F:\gkpics01.qcow2).LastWriteTime[0].ToString() ; "test" | Out-File -FilePath K:\tmpq\Downloads\a.txt ; sleep 6 ; $i = $i + 1 } But with this, I could not reproduce the stuck-state, it was acting like a keep-alive.
while neither disabling filebuffering in xcopy nor in qsd [EDIT] I got the stuck-state to happen at-least once in this way.

It feels like it is less likely for the stuck-state to happen, (shows up much later into the run), when I just stay glued to the same desktop-screen with the powershell windows and do nothing else but wait for xcopy or notepad to become stuck, and hoping to see a stuck script as well. Either stuck-state happens but script does not stop, or stuck-state-never-happens. It feels like it is more likely for the stuck-state to happen (shows up sooner into the run), when I had multi-tasked to a Firefox window on a different desktop-screen (win10 multiple desktop feature), which is what I did most of the time. After a while, as one browses, drive-lock happens. When a new weblink/tab is opened, the Firefox window freezes/becomes unresponsive. Perhaps, using something like Firefox, creates contention for filebuffer cache. One might take longer notice the stuck state, if one only context switches between browser tabs for which Firefox already has the pages artifacts in memory. Its a good thing that even in the stuck-state/drive-lockup, when things like even start-menu and taskbar might freeze, one can switch between desktops using Ctrl-Win-Left and Ctrl-Win-Right Keyboard shortcuts. (Rapid switching between desktop screens does not increase chances of the bug)

So file-activity on the host may have an influence on this bug.

activities like the running the script and monitoring it, tends to keep the file-buffer usage around the files being operated on, i.e. the qcow2 and hence make the stuck-state/bug less likely to happen later
activities like firefox, might be contending and using up filebuffer space and hence make the stuck-state/bug more likely to happen sooner
Using the workaround, either in qsd or in xcopy, preventing double layered file buffering, avoids encountering the stuck-state/bug.

These hunches I get could be superstitions.

That something external to both qsd and wnbd has this effect makes me think. This could be a bug involving file-buffering inside Win10 kernel itself.

writes are blocked while there's a pending cache flush, despite targeting a different disk

A cache flush of files at a top layer storage (J:), become block-writes into a vdisk-file, that are also cached, the vdisk-file being located on a lower layer (F:). The Windows File Cache-Manager manages Caching per file object. ref: learn.ms/file-caching Maybe fs-filebuffer for F: caches a small no blocks of the single large vdisk file (93 GiB), perhaps those blocks from the last few accesses WriteFile() / ReadFile()
The qsd workaround works because when J: files are being flushed, the qemu-writeFile() F: no longer requires interception and decision by the Windows File Cache-Manager. The xcopy /J workaround works because, there is no J: layer file-buffer cache to flush that is intense enough to cause F: writes to block.

Commands like Write-VolumeCache learn.ms/write-volumecache and sync F J sysinternals/sync are used to manually commit file-buffer to disk. An admin usually invokes them before unmounting/detaching a volume/disk.
1/8th of the yet-unwritten pages to the drive were written out per second ans.ms/031a , learn.ms/file-caching
Windows flushes its internal file cache on an eight-second rotating window. That is, every second, one-eighth of the dirty pages in the cache are queued for writing superuser.com
Disk queue length vs secs/rw learn.ms/qa19898
Performance Monitoring and Analysis Pierre Bijaoui, Juergen Hasslauer, in Designing Storage for Exchange 2007 SP1, 2008 Disk Queue Length
- There should never be more than half the number of spindles in the queue length sciencedirect
David B Probert - Windows Kernel Internals Cache Manager , Lectures
- Flushes are synchronous
An application can force the operating system to write the contents of these data buffers to the disk by using the FlushFileBuffers() function. Takes file handle as argument. learn.ms/flushing-buffered-io , learn.ms/FlushFIleBuffers
- qemu does call this function from block/io.c#L945/bdrv_co_pwrite_sync() -> block/io.c#L2901/bdrv_co_flush() -> block/io.c/bdrv_aio_flush() = block/file-win32.c/raw_aio_flush() -> paio_submit() -> thread_pool_submit_aio() -> aio_worker() -> block/file-win32.c#L137/FlushFileBuffers()
Should VM-s/Storage-Backends use the host-file-buffer-cache
- some sources say yes as it is a performance benefit suse
- From my reading and thinking so far,
- letting a VM/Storage-Backend use host-file-buffer-cache
  - con: the VM will consume host resources, and may exhaust the host-file-buffer-memory
  - con: one guest-VM may unfairly use more host-page-cache-resources, and flush-out/ squeeze out another VM-s page cache resources
- specifying cache.direct=on
  - pro: avoids bugs such as these.
  - if VM and letting VM guest OS manage its own file-buffer-cache in its own memory
  - pro: the guest OS will anyway use a small part of its memory for page-cache
  - pro: the guest OS has more information of the filesystem and file blocks it is saving, so it is in a better position to decide on how to flush/prune/populate its file-buffer-cache. Ex is the file marked as a tmp file, likelihood of frequent access, etc
  - pro: host and guest are insulated from one another.
  - If Storage-backend
  - pro: the file caching will anyway be done on the top level filesystem of host (or guest or whoever else client uses it).
- letting VM software/Storage-Backend (qemu/qsd) emulate its own disk-block-caches in its (qemu process's) userspace
  - this depends on the implementation, as of today, qemu/qsd doesn't have this, didn't see a way to configure a cache-size option per disk.
When we talk about file-caching
- There is the caching of files (and blocks of files) that have already been recently read or recently saved, to save on future access time.
- and there is also the caching of files (and blocks of files) that are still yet to be saved, to avoid lowering write-performance by having to do synchronous IO-writes to disks at slower disk-io-speed .
- The former if lost is not an issue, the later has much higher priority.
- When cache is flushed, the latter must be written to disks, while the former can be erased from memory.

Not yet had a testrun where workarounds (cache.direct=off or xcopy /j) did not work, That's good. Though the reproducibility without the workaround is probabilistic, the reason I may have had more certainty of running into it, is because I used to let it this run in the background while doing other things, including using Firefox to browse. In the 85GiB copy, most stucks do happen under copying of 13 GiB (5 min), I get a bit antsy if xcopy passes that mark. But, I've also had a testrun wherein the stuck-state happened near to the end of the xcopy.

In the event success happens, one needs to repeat the test a few times, in order to get a statistical feel so as to differentiate between the following cases.

test_for_bug_real_fail_success (common, mainly-expected)
test_for_bug_spurious_pass (happens rarely)
test_for_workaround_real_pass_success (common, mainly-expected)
test_for_workaround_spurious_pass (one never knows)
test_for_workaround_failure (never happened)

petrutlucian94 commented 1 year ago

I had a simple loop that was just copying files and it used to hang after about 5 minutes. After disabling the cache on the nbd server side, it transferred a few TBs overnight without problems.

The best thing we can do right now is to document this, I'll send a patch. Another option would be to forward LUN resets to the nbd daemon, however that would still be less than ideal: having IO stuck every 5 minutes for 30s (under load), until Storport issues a reset.

By the way, I've merged the PR that moves the NBD daemon to userspace (libwnbd): https://github.com/cloudbase/wnbd/pull/122.

LE: I've documented it here: https://github.com/cloudbase/wnbd/pull/123.

petrutlucian94 commented 1 year ago

After documenting the NBD caching issue and implementing stale connection detection (https://github.com/cloudbase/wnbd/pull/126), I think we can close this issue.

hgkamath commented 1 year ago

Yes, go ahead,

This is the corresponding issue on the qemu side, just for reference.
https://gitlab.com/qemu-project/qemu/-/issues/814
It only contains only occasional summaries as things cleared up.
I may need to update its description as many confounding issues cleared since time of first filing.

In a direct email communication that I had with qemu block-code developers, some points that came up :

Surprise that cache.direct / file-buffering matters as it is thought that the flag does not affect code path.
Insufficient testing of file-win32.c as well as development on the windows side of things
Possibility that createfile/overlapped flags were incorrect (as we ourselves had considered)
Surprise that qsd continues running normally after being blocked on the writefile, when drive-lockup is ended by uninstalling wnbd-driver and aborting the xcopy.
Speculation on whether worker thread exhaustion of a finite-pool of a pre-determined max-number-of-threads was a cause, Thread exhaustion could prevent nbd-export co-routine from waking up and responding to wnbd, thereby initiating the cyclic deadlock. One suggestion I made was to consider if it is possible to reserve a worker thread for nbd.
To test if QMP has also gone non-responsive during deadlock, in the manner of nbd. But to do this one needs a port of python-qmp on windows, and complicating matters is that the machine is in drive-lockup and so one can barely start/run anything.