cloudbase / wnbd

Windows Ceph RBD NBD driver
GNU Lesser General Public License v2.1
60 stars 26 forks source link

wnbd driver and qemu-nbd - DriveIO lockup and corruption #63

Closed hgkamath closed 1 year ago

hgkamath commented 2 years ago

I was doing some testing on WIN10-21H2-19044-1415/WNBD-0.2.2-4-g10c1fbe qemu-6.2.0-rc4/ExFAT/qcow2/NTFS

For complete details, see https://gitlab.com/qemu-project/qemu/-/issues/727#note_780895802

On Linux, only the VHDX format corrupts during expansion, while qcow2 and other virtual-disk formats seem fine. But on Windows, I found that even the qcow2 format too can corrupt during expansion (TEST E). For this reason, I think this is a separate distinct bug from the original and wnbd may also be a factor. The windows only bug is in addition to the qemu-only bug, which is also lurking in there.

The bug is reproducible. A script for synthetic data generation is provided in an earlier comment in that issue filing (link).

It is possible that this is a qemu-nbd-windows only bug, but then the wnbd driver is also involved here. The semi-official driver was test-signed and seemed to be still under development. I guess the bug needs to be squashed by troubleshooting and closing in on it from both sides.

The 5 test-cases can also be part of wnbd testing.

Q) Is there are recommended version of qemu-nbd that is known to work reliably with WNBD and the two together have been well tested?

petrutlucian94 commented 2 years ago

Hi,

Thank you for your bug report. From the link that you have provided, I understand that you've tested multiple nbd clients (wnbd as well as the Linux native nbd client), multiple qemu-nbd versions and image formats.

Since you're hitting the same issue with the Linux nbd client, I doubt this is a wnbd bug. I'd suggest narrowing down the problem, even taking nbd out of the picture completely. In your Qemu bug report, you're stating that some image formats are affected while some are not. You can verify this by directly attaching the images to Qemu vms, which would exercise the same Qemu block drivers, without actually using NBD.

My first assumption was there may be unflushed changes, but since you're unmounting the volumes before closing the nbd connection, that shouldn't be the case. It may be interesting to do a wnbd-client stats $mapping before removing the nbd connection though.

Q) Is there are recommended version of qemu-nbd that is known to work reliably with WNBD and the two together have been well tested?

wnbd provides a generic NBD client, the qemu-nbd version shouldn't really matter. In case of a protocol incompatibility, the connection would be dropped and the log would contain some meaningful error.

By the way, WNBD-6.12.39.253 isn't the actual wnbd version, you may use wnbd-client version to retrieve it.

Regards, Lucian

hgkamath commented 2 years ago

Thank-you for your reply.

Trying the qemu-VM from the inside is a good suggestion. qemu-system-x86_64 is very flaky under WHPX (windows hypervisor HyperV), perhaps I should consider setting it off. The qemu-vm is more reliable with Intel-haxm hypervisor. qemu tools like qemu-img and qemu-nbd are smaller independent binaries that do not engage qemu's virtualization/emulation, just the qemu-block framework. I will try it to do it some time, and see if corruption happens that way.

In Windows qemu-setup (https://qemu.weilnetz.de/), the qemu-nbd.exe binary started getting built and included with the shipped qemu-setup since 2021-05-05, qemu-version 6.0. So this windows qemu nbd-server is a relatively recent addition to the set of available nbd peer/server/clients. qemu-w64-setup-20210505.exe

qemu-nbd is a local nbd-client. Both SSD and HDD are connected to mini-SATA. Hence read/write speeds are much higher than what one may get over the network. resmon.exe claims 70 million bytes per sec though the qemu-nbd.exe process. Ordinary network speeds are often lower than native bus speeds, even though they claim 1gbps ethernet, consumer network traffic seldom really reaches the speed of the system-pci-express-bus. At this native machine speed, there is bound to be bottlenecks and buffer-fill/overfill.

Description of what is meant by when the machine is stuck: When the stuck happens, it is lucky that some open GUI processes work/respond at all. I keep a few extra privileged cmd.exe pwsh.exe windows open. Any other process that makes any disk access goes into a presumably io-blocked "not responding state". It is not possible to invoke new programs. Hence, one must be very careful with the processes that do respond. As long as no disk access is made, processes such as resmon.exe, taskmgr.exe continue to refresh and display. GUI components like menus, selections work. Text in notepad.exe can be editted but if attempted to be saved will cause it to go into not-responding. CPU usage is cool, typing, menu GUI, mouse movements are normal. Those on-screen applications that work are responsive. The only program that works is the cmd/pwsh windows. Attempting to change directory with cd will also make the command shell become stuck. It does not matter which hard-disk is being accessed, SSD1, HDD1, HDD2 etc.

I think the reason why wnbd_client.exe can be made to run from the cmd.exe prompt is because its dll, exe images are memory cached, and so shell is not disk-io blocked. However, other than stat, uninstall driver other commands have a very very long time of getting a response from the wnbd-driver.

qemu, qcow2 and kernel-nbd are well tested and work pretty rock-solid-ly under Linux. This is the other main reason why this issue is most likely a separate bug (bug2-63) as its effect is seen in different pathway, and affect any type of virtual-disk-format including qcow2. In the original bug (bug1-727), only the virtual-disk-format VHDX is affected, everything else is fine.

Uninstalling and reinstalling the driver is the only way to unstuck the situation. At this point, all not-responding processes/apps come back alive. Until I discovered this, I was only force shutting down laptop. the following don't work ctrl-C or taskmgr-end-task on qemu-nbd or xcopy, attempt wnd-client unmap. The taskmgr will accept the end-task the first time on the io-blocked not-responding process, but will be unable to truly terminate it. If the taskmgr end-task attempted again on a previously end-tasked process, a popup dialog "action not allowed" is shown.

E:\vstorage\nbd\wnbd_client>wnbd-client.exe uninstall-driver
libwnbd.dll!RemoveAllWnbdDisks INFO Hard removing WNBD disk: gkpics01
libwnbd.dll!RemoveWnbdAdapterDevice INFO Removing WNBD adapter device. Hardware id: root\wnbd. Class GUID: {4D36E97B-E325-11CE-BFC1-08002BE10318}
libwnbd.dll!CleanDrivers INFO Removing WNBD driver: oem30.inf

E:\vstorage\nbd\wnbd_client>wnbd-client.exe install-driver E:\vstorage\nbd\wnbd_driver\wnbd.inf
libwnbd.dll!WnbdInstallDriver INFO Cleaned up stale WNBD adapters.
libwnbd.dll!InstallDriver INFO Installing WNBD driver: E:\vstorage\nbd\wnbd_driver\wnbd.inf
libwnbd.dll!CreateWnbdAdapter INFO Creating WNBD adapter. Class: SCSIAdapter. Hardware id: root\wnbd.

This also hints to me that it is the wnbd driver that is freezing the disk-accesses of the system. The desktop is logged-in as an unprivileged non-administrative user. The qemu-nbd.exe is an unprivileged process started/run in an unprivileged shell. Under normal use, while running, it is easily stopped with ctrl-C or taskmgr. It can't really do anything to bring the whole system down.

Summary of why bug2-63 is separate from bug1-727 and why it might be WNBD and not qemu-nbd.exe

  1. any virtual-disk format is affected (even qcow2, not just VHDX)
  2. The whole local-machine with unrelated process is stuck, disk-io frozen (an unprivileged process cannot do this)
  3. uninstalling, installing the driver using the wnbd-client, unstucks the system
  4. qemu-nbd is a local nbd-server. Other previous use-cases of wnbd may have been true remote network-based NBD-servers, that may not push wnbd-driver and local machine windows-kernel to block-transfer speed limits.

Miscellaneous troubleshooting logs

PS E:\vstorage\nbd\wnbd_client> .\wnbd-client.exe list-opt
LogLevel           : 1 (Default: 1)
NewMappingsAllowed : true (Default: true)
EtwLoggingEnabled  : true (Default: true)
WppLoggingEnabled  : false (Default: false)
DbgPrintEnabled    : true (Default: true)
PS E:\vstorage\nbd\wnbd_client> .\wnbd-client.exe set-opt LogLevel 4
PS E:\vstorage\nbd\wnbd_client> .\wnbd-client.exe list-opt
LogLevel           : 4 (Default: 1)
NewMappingsAllowed : true (Default: true)
EtwLoggingEnabled  : true (Default: true)
WppLoggingEnabled  : false (Default: false)
DbgPrintEnabled    : true (Default: true)

PS E:\vstorage\nbd\wnbd_client> logman delete "WNBD_tracing_session"
The command completed successfully.
PS E:\vstorage\nbd\wnbd_client> logman create trace "WNBD_tracing_session" -p "{E35EAF83-0F07-418A-907C-141CD200F252}" "0xffffffff" "0xff" -o e:\tmpq\TraceFile_2.etl -rt

still trying to capture some logs.

Miscellaneous trouble shooting

qemu-w64-setup-20210505.exe (6.0.0)
qemu-w64-setup-20210619.exe (6.0.0) <- i used this
qemu-w64-setup-20211215.exe (6.2.0) <- earlier I had tried this

Below are the files one needs, and one can just extract these exe/dlls from the qemu-setup and cephpacific-setup without doing a full install

PS E:\vstorage\nbd\qemu-nbd-6.0> (Get-ChildItem).Name
iconv.dll
libbz2-1.dll
libcurl-4.dll
libffi-6.dll
libglib-2.0-0.dll
libgmp-10.dll
libgnutls-30.dll
libhogweed-4.dll
libidn2-0.dll
libintl-8.dll
libnettle-6.dll
libnghttp2-14.dll
libp11-kit-0.dll
libpcre-1.dll
libssp-0.dll
libtasn1-6.dll
libunistring-2.dll
libwinpthread-1.dll
libzstd-1.dll
qemu-nbd.exe
zlib1.dll

PS E:\vstorage\nbd\wnbd_client> (Get-ChildItem).Name
libwnbd.dll
wnbd-client.exe

PS E:\vstorage\nbd\wnbd_driver> (Get-ChildItem).Name
wnbd.cat
wnbd.inf
wnbd.sys

below are some wnbdclient stat captures. Usually the numbers OutstandingIO, pendingSubmittedIO and TotalReceivedIO add up. But start diverging after the stuck happens. Thereafter, the wnbd-client collecting stat responses is not longer quick but takes quiet some time.

PS E:\vstorage\nbd\wnbd_client> ./wnbd-client.exe stats --instance-name gkpics01
Disk stats
TotalReceivedIORequests        : 4645
TotalSubmittedIORequests       : 4645
TotalReceivedIOReplies         : 4645
UnsubmittedIORequests          : 0
PendingSubmittedIORequests     : 0
AbortedSubmittedIORequests     : 0
AbortedUnsubmittedIORequests   : 0
CompletedAbortedIORequests     : 0
OutstandingIOCount             : 0

When stuck happens,

The below file (contains random bytes) in the xcopy of the synthetic generated data set is when the system become stuck

PS C:\Users\gana> pwsh -Command { $sdate=$(date) ; echo "$sdate" ; xcopy /E E:\test\sgdata K:\ ; echo "$sdate" ; date } 
12/18/2021 15:41:18
E:\test\sgdata\photos001\find.CHECKSUM
:
E:\test\sgdata\photos001\D0000_88114831\File_0000.binE:\test\sgdata\photos001\D0011_325a7df4\D0007_cfa36bee\File_0000.bin
E:\vstorage\nbd\wnbd_client>dir H:
 Volume in drive H is CPERF0
 Volume Serial Number is F196-DB9E

 Directory of H:\

12/18/2021  03:44 PM     6,194,987,008 gkpics01.vhdx
               1 File(s)  6,194,987,008 bytes
               0 Dir(s)  149,989,228,544 bytes free
E:\vstorage\nbd\wnbd_client>wnbd-client.exe stats --instance-name gkpics01
Disk stats
TotalReceivedIORequests        : 8706
TotalSubmittedIORequests       : 8695
TotalReceivedIOReplies         : 8677
UnsubmittedIORequests          : 11
PendingSubmittedIORequests     : 18
AbortedSubmittedIORequests     : 19
AbortedUnsubmittedIORequests   : 9
CompletedAbortedIORequests     : 0
OutstandingIOCount             : 1

E:\vstorage\nbd\wnbd_client>wnbd-client.exe stats --instance-name gkpics01
Disk stats
TotalReceivedIORequests        : 8841
TotalSubmittedIORequests       : 8695
TotalReceivedIOReplies         : 8677
UnsubmittedIORequests          : 146
PendingSubmittedIORequests     : 18
AbortedSubmittedIORequests     : 19
AbortedUnsubmittedIORequests   : 144
CompletedAbortedIORequests     : 0
OutstandingIOCount             : 1
hgkamath commented 2 years ago

In the below test I use a qemu VM to do copy from inside VM.

Preparation

After booting Alpine Linux 3.15

VHDX TEST

In cmd prompt: 
C:\Windows> copy E:\vstorage\gkpics01_new.vhdx H:\gkpics01.vhdx

C:\Users\gana>dir E:\vstorage\gkpics01_new.vhdx
 Volume in drive E is 15KJ_Images
 Volume Serial Number is 1EA6-2771

 Directory of E:\vstorage

11/12/2021  01:21 PM       205,520,896 gkpics01_new.vhdx
               1 File(s)    205,520,896 bytes
               0 Dir(s)  360,436,940,800 bytes free

PS E:\vstorage\nbd\qemu-nbd-6.0> & C:\vol\scoop_01\scoopg\shims\qemu-system-x86_64 -cpu qemu64 -m 4096 -machine "type=q35,kernel-irqchip=off" -accel whpx -smp "sockets=1,cores=8,threads=1" -boot d -cdrom E:\transcend\alpine-standard-3.15.0-x86_64.iso -hda H:\gkpics01.vhdx -hdb E:\test\sgdata.vhdx  -display gtk -vga virtio -rtc base=utc -net user -device "virtio-net,netdev=vmnic" -netdev "user,id=vmnic,net=192.168.20.0/24,dns=192.168.20.3,dhcpstart=192.168.20.15"
C:\vol\scoop_01\scoopg\apps\qemu\current\qemu-system-x86_64.exe: warning: GdkPixbuf: Cannot open pixbuf loader module file 'C:\vol\scoop_01\scoopg\apps\qemu\current\lib\gdk-pixbuf-2.0\2.10.0\loaders.cache': No such file or directory

This likely means that your installation is broken.
Try running the command
  gdk-pixbuf-query-loaders > C:\vol\scoop_01\scoopg\apps\qemu\current\lib\gdk-pixbuf-2.0\2.10.0\loaders.cache
to make things work again for the time being.
Windows Hypervisor Platform accelerator is operational
C:\vol\scoop_01\scoopg\apps\qemu\current\qemu-system-x86_64.exe: warning: hub 0 with no nics

Inside VM, prepare machine as mentioned earlier, setup-alpine .. mount partitions

localhost:/mnt# (sdate=`date` ; cd /mnt/b ; rsync -avH ./photos001 /mnt/a ; echo "$sdate" ; date ) 
localhost:/mnt# ls -l /mnt/a/photos001

The VHDX fails badly, even an ls -l did not succeed. Reboot was not required. nothing surprising here.

QCOW2 TEST

E:\vstorage\nbd\wnbd_client>C:\vol\scoop_01\scoopg\apps\qemu\current\qemu-img.exe create -f qcow2 H:\gkpics01.qcow2 99723771904
Formatting 'H:\gkpics01.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=99723771904 lazy_refcounts=off refcount_bits=16

E:\vstorage\nbd\wnbd_client>dir H:\gkpics01.qcow2
 Volume in drive H is CPERF0
 Volume Serial Number is F196-DB9E

 Directory of H:\

12/19/2021  07:52 PM           198,144 gkpics01.qcow2
               1 File(s)        198,144 bytes
               0 Dir(s)  156,183,953,408 bytes free

PS E:\vstorage\nbd\qemu-nbd-6.0> & C:\vol\scoop_01\scoopg\shims\qemu-system-x86_64 -cpu qemu64 -m 4096 -machine "type=q35,kernel-irqchip=off" -accel whpx -smp "sockets=1,cores=8,threads=1" -boot d -cdrom E:\transcend\alpine-standard-3.15.0-x86_64.iso -hda H:\gkpics01.qcow2 -hdb E:\test\sgdata.vhdx  -display gtk -vga virtio -rtc base=utc -net user -device "virtio-net,netdev=vmnic" -netdev "user,id=vmnic,net=192.168.20.0/24,dns=192.168.20.3,dhcpstart=192.168.20.15"
C:\vol\scoop_01\scoopg\apps\qemu\current\qemu-system-x86_64.exe: warning: GdkPixbuf: Cannot open pixbuf loader module file 'C:\vol\scoop_01\scoopg\apps\qemu\current\lib\gdk-pixbuf-2.0\2.10.0\loaders.cache': No such file or directory

This likely means that your installation is broken.
Try running the command
  gdk-pixbuf-query-loaders > C:\vol\scoop_01\scoopg\apps\qemu\current\lib\gdk-pixbuf-2.0\2.10.0\loaders.cache
to make things work again for the time being.
Windows Hypervisor Platform accelerator is operational
C:\vol\scoop_01\scoopg\apps\qemu\current\qemu-system-x86_64.exe: warning: hub 0 with no nics

Inside VM, prepare machine as mentioned earlier, setup-alpine .. mount partitions rsync copy

localhost:/mnt# (sdate=`date` ; cd /mnt/b ; rsync -avH ./photos001 /mnt/a ; echo "$sdate" ; date ) 
:
Sent 90,963,665,782 bytes  recevied 298,855 bytes  60,622,431.61 bytes/sec
total size is 90,940,594,995  speedup is 1.00
Sun Dec 19 20:19:26 IST 2021
Sun Dec 19 20:44:26 IST 2021

Verifying source drive

localhost:/mnt# (sdate=`date` ; cd /mnt/b/photos001 ; sha256sum -c ./find.CHECKSUM | tee /tmp/fb ; echo "$sdate" ; date ) 
:
Sun Dec 19 21:28:09 IST 2021
Sun Dec 19 21:48:35 IST 2021
localhost:/mnt# cat /tmp/fb | grep -v ": OK" | wc -l  
0

Verifying the Destination drive containing with the rsynced/copied sgdata

localhost:/mnt# (sdate=`date` ; cd /mnt/a/photos001 ;  sha256sum -c ./find.CHECKSUM | tee /tmp/fa ; echo "$sdate" ; date ) 
:
/D0039_1923a8ec/File_0002.bin: OK
Sun Dec 19 20:47:53 IST 2021
Sun Dec 19 21:07:39 IST 2021
localhost:/mnt# cat /tmp/fa | grep -v ": OK" | wc -l  
202

!! QCOW2 is corrupted

It is true that there are bugs in both code pathway of VHDX and qcow2 qemu virtual-disks, which show up even without the code-pathway of nbd being involved. However, it is also true that no system-stuck lockup happened.
The system lockup is the doing of the wnbd driver. That is why I think wnbd itself does also have a bug.

The nbd protocol, and the wnbd kernel driver need to be robust/resilient towards failures in the nbd-peer. So if nbd peer is screwing up, at worst it should disconnect drive backend connection, or refuse service, or try re-transmission, and if it is re-transmission, it shouldn't get stuck in an infinite loop, locking up the whole system

The system drive-io lockup is reminiscent of situations like bad hard-drive bad-sector errors, or cdrom-read errors, where-in the windows-system get stuck trying to do a retry without ever succeeding, or taking a long time to move on. I don't think the my SSD drive really has problem blocks. Data integrity is fine when large virtual-disks are copied onto and off. So because qemu-nbd is screwing up, wnbd is locking up.

Summary posted to bug1-727 comment https://gitlab.com/qemu-project/qemu/-/issues/727#note_789170496

petrutlucian94 commented 2 years ago

So in your case, some IO operations are hanging on the nbd server side. Windows (storport.sys to be precise) will cancel pending IO requests after about 30s, in which case that AbortedSubmittedIORequests counter will be increased.

It's true that the Windows explorer usually hangs if the drives become unresponsive, but that's not WNBD's fault. It's the same with directly attached disk drives or SMB shares. WNBD simply receives IO requests from Storport and forwards them to an NBD server (or to a registered userspace app that uses libwnbd, but that's a different topic).

Storport miniport drivers, such as WNBD, are not supposed to retry canceled operations. At best, we may try to detect IO deadlocks and disconnect the disk, but that should be configurable.

A few other tests that you could make:

petrutlucian94 commented 2 years ago

I'm going to close this issue since it's not a WNBD driver bug, but rather a qemu-nbd or environment issue.

hgkamath commented 2 years ago

sorry for not updating, This is on my mind, but didn't yet got to do it. I'm still yet to test qemu-7 , which just got recently released. qemu-project also released a new binary qemu-storage-daemon.exe which also can act like a nbd server, amongst other things.

Just some updates. The kernel-5.15.x series upto 5.15.24, had ntfs3 read-side bugs, and upto 5.17.x series there were folio/mmotm feature commits and I had taken a pause until some other qemu/kernel bugs sorted out, so that there is less ambiguity and compounding. I don't think the wnbd driver is in the clear yet. But it is true, that some of the negative test-results above, that I did, after taking your very-good-suggestion of testing wholly inside a qemu-vm, were on account of these bugs.

I'll reopen this when I get to test it and it seems more sure that it isn't due to a qemu/linux-guest-kernel/linux-host-kernel issue

Also on a side note, I wish cloudbase released a regular-signed wnbd driver, which you should, if you think it is bug free. On my laptop, in order to use the testsigned driver, I need to use the 1 time disable device-driver-signature-enforcement special boot, or use the testsign mode which the HP-UEFI requires me to disable secureboot. The test-mode also puts a ugly wallpaper watermark. This hassle is what creates reluctance to test. Suppose things work well, I am not sure if I would go down a decision of mounting drives via wnbd this way on a regular basis, if it requires me to permanently turn off secureboot.

petrutlucian94 commented 2 years ago

Thanks for feedback. We're currently shipping a signed version of the driver with the Ceph msi installer: https://cloudbase.it/ceph-for-windows/, I hope that helps. In the future, we may consider a standalone WNBD installer.

hgkamath commented 2 years ago

I think the ceph-msi-installer ceph_pacific_beta.msi which is still beta is also just test-signed. (am i wrong?) I had used the same before also.

  1. Open device manager, right click any wnbd SCSI virtual adapter and uninstall device
  2. Open admin cmd prompt
  3. (only if updating driver) Uninstall ceph from programs and features
  4. Ensure that there is no prior certificate under trusted publisher from Cloudbase Solutions Srl . a. Open cmd certmgr b. Delete cloudbase certificate under trusted-publishers/certificates
  5. (only if updating driver) Reinstall ceph
  6. issue command : .\wnbd-client.exe install-driver --filename 'C:\Program Files\Ceph\driver\wnbd.inf'
  7. Dialog prompt comes up: "would you like to install this device software with name and publisher a. ensure checkbox always trust software from Cloudbase Solutions Srl is checked b. click install
  8. Driver is installed, and wnbd driver shows up under storage controllers in device manager
  9. Close certmgr.exe, wait a minute and again reopen certmgr.exe. a. verify that cloudbase certificate is again present under trusted-publishers/certificates
    b. Double click/Open the certificate. Verify - . 1. In general tab: The digital signature is OK , certificate validity is from 8/13/2021 to 8/12/2024 . 2. In details tab: Key usage under has yellow triangle for Key Usage : Digital Signature (80)
  10. However, driver in device mgr has yellow triangle. and right click property gives code 52 Windows cannot verify the digital signature for the drivers required for this device. A recent hardware or software change might have installed a file that is signed incorrectly or damaged, or that might be malicious software from an unknown source. (Code 52)
PS E:\vstorage\nbd\wnbd_client> .\wnbd-client.exe -v
wnbd-client.exe: 0.2.2-4-g10c1fbe
libwnbd.dll: 0.2.2-4-g10c1fbe
libwnbd.dll!WnbdOpenAdapterEx ERROR No WNBD adapter found. Please make sure that the driver is installed.

driverver from wnbd.inf: DriverVer = 04/28/2022,6.15.27.572

petrutlucian94 commented 2 years ago

Right, it's signed using a Cloudbase certificate and not by Microsoft, which is why secure boot will have to remain disabled. However, you may disable the "testsigning" setting (allowing self signed drivers) when using the Ceph msi.

We decided to keep the "beta" label until we implement CI testing and get more user feedback. The good news is that a Windows Ceph CI job has been added recently [1] and that we got quite a few users already, which provide valuable feedback (as you did already).

A stable driver release will also go through the Microsoft driver attestation and signing process. Unfortunately, I can't provide an ETA but I hope that this will be ready by the next Ceph release.

[1] https://jenkins.ceph.com/job/ceph-windows-pull-requests/6850/

Jazz9 commented 2 years ago

I have been chasing the same issue for a few months now. 16.2.7 ceph and windows 2019. I see this across 4 separate clusters. The larger the cluster the worse it seems to be ( ranging from 50 -340 osds per cluster )

We had a firmware bug where OSDs restarted randomly and this would trigger the windows RBD volume to lock up and never return. A restart of the windows server would sometimes cause refs to corrupt and go RAW. We have implemented snapshots to try mitigate this.

A planned reboot of a ceph node doesn't seem to cause the issue. Only a OSD assert.

The fault is characterized by PendingSubmittedIORequests going up but not aborts.

We are open to paying for support to correct this issue.

petrutlucian94 commented 2 years ago

@Jazz9 thanks for feedback. The issue may be slightly different as the OP wasn't using RBD but actually NBD (wnbd can also act as a standalone nbd client) and it affected other NBD clients as well.

There might be some problem around aborted or timed out requests, so I think it's worth reopening this issue. The bug may also reside on the Ceph side (rbd-wnbd leverages wnbd), which may not be able to gracefully handle lost OSD connections (at least on Windows). Just checking: have you encountered similar issues with Linux RBD clients?

As a side note, there is a recent Ceph fix that replaces select() with poll() on Windows in order to avoid fd limitations, which can be a problem when having a large amount of OSDs: https://github.com/ceph/ceph/pull/46525.

We'll definitely need to investigate this issue. Unfortunately, I'm no longer actively working on this project (at least for now), so I can't provide an ETA. One of my colleagues may be able to reproduce it.

If needed, you may get in touch with our support team by filling this form: https://cloudbase.it/about/#contact

petrutlucian94 commented 2 years ago

We found the issue. If all the IO dispatch workers are waiting for RBD callbacks indefinitely, the userspace (libwnbd) may never get notified of mapping removals. We'll need to improve the way in which the driver informs the IO dispatcher that the disk has been removed (e.g. by using a seperate control thread or some disconnect events).

We'll also have to avoid waiting for IO operations indefinitely on the rbd side (e.g. maybe double check our timeouts).

Jazz9 commented 2 years ago

Great! Normally windows is really good at waiting for IO.

I would expect that the rbd side would need to timeout and retry the IO operation ( 20 sec timeout? ). That would assist if the OSD had restarted or the PG peered to another OSD.

Do you have a rough ETA on the fix?

petrutlucian94 commented 2 years ago

There might be more to it since all rbd-wnbd IO callbacks are actually async and should return immediately (sometimes they don't and we see the hang). We even saw librbd::Image::close hang, the async version is likely going to hang as well.

It's a bit tricky to reproduce since on our env it only happens once per ~1000 rbd mappings. My colleague's been trying out pacific, quincy and the master ceph branch to see if it's a regression and collect some logs/process dumps.

While there are a few possible mitigations, we'll need to look into the hanging async calls which is probably the root cause of the issue. Unfortunately I can't provide an ETA, I'll try to assist the colleagues that are currently maintaining the project.

petrutlucian94 commented 2 years ago

A small update: looks like we're affected by this winpthreads bug that causes shared mutexes to deadlock: https://github.com/msys2/MINGW-packages/issues/3319. Trace: https://pastebin.com/raw/i3jpTyS3

Ideally, we should get winpthreads fixed. The alternative would be to use a different shared mutex implementation. We did consider building Ceph using Clang or MSVC but that would require a large amount of code changes on the Ceph side.

lhmouse commented 2 years ago

Hello, mingw-w64 developer speaking. I have made some efforts to replace winpthreads due to its lack of maintainability. Do you have any interest in https://gcc-mcf.lhmouse.com/? The mcfgthread library doesn't implement RW locks, but libstdc++ has an alternative implementation using condition variables, so shared_mutex also works there.

petrutlucian94 commented 2 years ago

hi @lhmouse, mcfgthread seems interesting, thanks a lot for the suggestion.

We're currently cross-compiling Ceph using Linux, so we'll have to build mcfgthread and also gcc with the custom patches on top. Would you recommend using a specific gcc version? I see that there are some wip patches for gcc 12: https://github.com/lhmouse/MINGW-packages/tree/master/mingw-w64-gcc

Thank you for your work on mingw, btw :).

lhmouse commented 2 years ago

hi @lhmouse, mcfgthread seems interesting, thanks a lot for the suggestion.

We're currently cross-compiling Ceph using Linux, so we'll have to build mcfgthread and also gcc with the custom patches on top. Would you recommend using a specific gcc version? I see that there are some wip patches for gcc 12: https://github.com/lhmouse/MINGW-packages/tree/master/mingw-w64-gcc

Only the 9000- and 9001- patches are necessary; the later is required for building libgfortran, otherwise it can be dropped.

The mcfgthread library provides its own __cxa_atexit() and __cxa_finalize() implementations, which requires modification of the CRT. A complete set of patches (for both GCC and mingw-w64) is available at https://github.com/lhmouse/mcfgthread/tree/master/patches; please refer to the README.md file for details.

The patch for GCC should apply to GCC 11 and 12 branches cleanly. As far as I remember there haven't been any changes since GCC 9; probably it applies to 9 and 10 branches as well.

I built my personal toolchains using the standard cross-compilation procedure, which was necessary to replace the default CRT in MSYS2. It should also work when cross-compiling from Linux, I think.

Thank you for your work on mingw, btw :).

My pleasure.

Jazz9 commented 2 years ago

It looks like you are close to a resolution on this issue. Do you have a ETA on the release?

petrutlucian94 commented 2 years ago

It looks like you are close to a resolution on this issue. Do you have a ETA on the release?

the fix got approved, it's in the Ceph QA queue right now. I expect it to be merged in a couple of weeks. Afterwards, I'll backport it to Quincy and Pacific.

Jazz9 commented 2 years ago

@petrutlucian94 - Did you have an update on the backport / availability of this fix?

petrutlucian94 commented 2 years ago

@petrutlucian94 - Did you have an update on the backport / availability of this fix?

hi, the fix has been backported to pacific and quincy. there are a few more patches that we'd like to include before releasing a new MSI.

hgkamath commented 2 years ago

RC release of quincy 17.2.4 is due shortly 20220919 https://lists.ceph.io/hyperkitty/list/dev@ceph.io/thread/DUWEJXQH4XRU5UBQOH7IZXVJVCAMOMUJ/ I hope the task of making a signed windows driver for the final-release doesn't slip though the cracks.

petrutlucian94 commented 2 years ago

RC release of quincy 17.2.4 is due shortly 20220919 https://lists.ceph.io/hyperkitty/list/dev@ceph.io/thread/DUWEJXQH4XRU5UBQOH7IZXVJVCAMOMUJ/ I hope the task of making a signed windows driver for the final-release doesn't slip though the cracks.

actually, we've just prepared Pacific and Quincy releases that include some of the fixes that didn't merge upstream yet. we've just finished testing it, so expect a new msi the following days.

for what is worth, the wnbd driver remains mostly the same, all the fixes that we had were actually targeting ceph, not the driver.

petrutlucian94 commented 2 years ago

The Ceph MSIs have been updated: https://cloudbase.it/ceph-for-windows/

I think we can close this issue now. Just to clarify, we ended up using the same GitHub issue for two (or more) different problems:

hgkamath commented 2 years ago

I'm gonna test the latest beta ceh-quincy-17.2.4 with qemu-7.1 out of curiosity. Disabling secure-boot may be deterrent for me to long term transition to qemu<->nbd<->wnbd-nbd-driver<->diskmgmt

Please, name some of the device vendors (co.) and their corresponding devices (model no), that have a nbd driver. I suspect such strategy by them could work, if the base case covers 99% of all user needs. Those users that want to live on edge with latest bleeding edge non-mainstream features, will sort of be the type who know what they are doing, entertain risk and disable secureboot. In this case we are yet to determine if this wnbd-nbd driver, covers as a base case reliable nbd driver.

Btw, I think qemu can also provide a rbd device, I think. would I have better luck with qemu<->rbd<->wnbd-rbd-driver<->diskmgmt

petrutlucian94 commented 2 years ago

Please, name some of the device vendors (co.) and their corresponding devices (model no), that have a wnbd driver.

Can't remember if it was a chelsio or a cisco network card driver - the Microsoft signed driver was about 2 years behind the vendor signed driver, most probably because of the same reason. It definitely wasn't a wnbd driver, just their network card driver. My point was that other vendors are affected by this as well and prefer not to go through the Microsoft signing process.

petrutlucian94 commented 2 years ago

Btw, I think qemu can also provide a rbd device, I think. would I have better luck with qemu<->rbd<->wnbd-rbd-driver<->diskmgmt

at that point you can just take qemu-nbd out of the picture and use rbd. it depends on your use case.

hgkamath commented 2 years ago

20220926 Changes that may be in effect since when tests were last done and now

Haswell i7-4700QM 
WinOS, qemu-VM, Test-image all on SSD, sgdata (synthetic generated data) on HDD

Windows-Home
winver: 21H2 (OS Build: 19044.2075) 

Fedora 36 Workstation
Linux fedora 5.18.17-200.fc36.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Aug 11 14:36:06 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Ceph-Quincy-17.2.4 20220920
E:\vstorage\nbd>E:\vstorage\nbd\wnbd_client\wnbd-client.exe -v
wnbd-client.exe: 0.2.2-11-g3dbec5e
libwnbd.dll: 0.2.2-11-g3dbec5e
wnbd.sys: 0.2.2-11-g3dbec5e

E:\vstorage\nbd>qemu-nbd --version
qemu-nbd 7.1.0 (v7.1.0-11925-g4ec481870e-dirty)
Written by Anthony Liguori.

E:\vstorage\nbd>C:\vol\scoop_01\scoopg\apps\qemu\current\qemu-storage-daemon.exe --version
qemu-storage-daemon version 7.1.0 (v7.1.0-11925-g4ec481870e-dirty)
Copyright (c) 2003-2022 Fabrice Bellard and the QEMU Project developers

TESTS

Test-01 :

Test-02 :

Test-03 :

Test-04 :

Test-05 :

Test-06 :

Test-07 :

Test-08 :

Test-09 :

Test-10 :

Test-11 :

Test-12 :

Test-13 :

Test-14 :

Test-15 :

Test-16 :

Test-17 :

Notes

  1. qcow2-fixed had to be created under a Linux-Host
  2. imdisk uses discutils which does not support qcow2, only rawfsimg, dmg, vdi, vhd, vhdx, vmdk, Xva

Commands involved

Creating a disk image in Linux

raw

dd if=/dev/zero of=/mnt/a16/gkpics01.raw.img bs=1048576 count=95104 status=progress

qcow2-fixed

qemu-img create -f qcow2 H:\gkpics01.qcow2 -o preallocation=full 99723771904

qcow2-dyn

qemu-img create -f qcow2 /mnt/a16/gkpics01.qcow2 99723771904

vhdx-fixed

qemu-img create -f vhdx -o subformat=fixed,,block_state_zero=off /mnt/a16/gkpics01.vhdx 99723771904

vhdx-dyn

qemu-img create -f vhdx -o subformat=dynamic /mnt/a16/gkpics01.vhdx 99723771904

Creating a disk image in Windows

raw

"C:\vol\prog_01\prog\windd_chrysocome\ddrelease64.exe" bs=1048576 count=95104 if=/dev/zero of=H:/gkpics01.raw.img --progress

qcow2-fixed ( Not Do-able as of now https://gitlab.com/qemu-project/qemu/-/issues/813 )

qemu-img.exe create -f qcow2 H:\gkpics01.qcow2 -o preallocation=full 99723771904

qcow2-dyn

qemu-img create -f qcow2 /mnt/a16/gkpics01.qcow2 99723771904

vhdx-fixed

New-VHD -Path I:\gkpics01.vhdx -SizeBytes 99723771904 -Fixed
or
qemu-img create -f vhdx -o subformat=fixed,block_state_zero=off F:/gkpics01.vhdx 99723771904
fsutil sparse setFlag F:\gkpics01.vhdx 0

vhdx-dyn

New-VHD -Path I:\gkpics01.vhdx -SizeBytes 99723771904 -Dynamic
or
qemu-img create -f vhdx -o subformat=dynamic F:/gkpics01.vhdx 99723771904
fsutil sparse setFlag F:\gkpics01.vhdx 0

Blanking a block-device in Linux / Linux-VM

[root@fedora mnt]# ( sdate=`date` ; echo "$sdate" ; dd if=/dev/zero of=/dev/vdc bs=$((512*204)) status=progress ; echo "$sdate" ; date )
Mon Sep 26 09:08:25 AM IST 2022
99677337600 bytes (100 GB, 93 GiB) copied, 1919 s, 51.9 MB/s
dd: error writing '/dev/vdc': No space left on device
954770+0 records in
954769+0 records out
99723771904 bytes (100 GB, 93 GiB) copied, 1920.97 s, 51.9 MB/s
Mon Sep 26 09:08:25 AM IST 2022
Mon Sep 26 09:40:26 AM IST 2022

Make ntfs fs under Linux-VM

[root@fedora gana]# mkfs -t ntfs -Q -L fs_gkpics01 /dev/vdc1

Run an nbd-server in windows

qemu-nbd as nbd-server

E:\vstorage\nbd>qemu-nbd.exe H:\gkpics01.raw.img -x gkpics01
WARNING: Image format was not specified for 'H:\gkpics01.raw.img' and probing guessed raw.
         Automatically detecting the format is dangerous for raw images, write operations on block 0 will be restricted.
         Specify the 'raw' format explicitly to remove the restrictions.

qemu-storage-daemon as nbd-server

direct raw
E:\vstorage\nbd> qemu-storage-daemon --blockdev driver=file,node-name=disk,filename=H:\gkpics01.raw.img --nbd-server addr.type=inet,addr.host=127.0.0.1,addr.port=10809 --export type=nbd,id=export,node-name=disk,name=gkpics01,writable=on

With a vdisk-format
qemu-storage-daemon --blockdev driver=file,node-name=file,filename=H:\gkpics01.qcow2 --blockdev driver=qcow2,node-name=qcow2,file=file --nbd-server addr.type=inet,addr.host=127.0.0.1,addr.port=10809 --export type=nbd,id=export,node-name=qcow2,name=gkpics01,writable=on

Starting a qemu-VM in Windows

qemu-system-x86_64 -cpu qemu64 -m 4096 -machine "type=q35,kernel-irqchip=off" -accel whpx -smp "sockets=1,cores=8,threads=1" -boot c -drive "index=0,if=virtio,media=disk,format=raw,file=H:\m01_bootnoefi.raw.img" -drive "index=1,if=virtio,media=disk,format=raw,file=H:\m01_lnx.raw.img" -drive "index=2,if=virtio,media=disk,format=qcow2,file=H:\gkpics01.qcow2"  -drive "index=3,if=virtio,media=disk,format=vhdx,file=E:\test\sgdata.vhdx" -display gtk -vga virtio -rtc base=utc -netdev user,id=vmnic1,net=192.168.20.0/24,dns=192.168.20.3,dhcpstart=192.168.20.15,hostfwd=tcp::9551-:22 -device virtio-net,netdev=vmnic1 -chardev qemu-vdagent,id=ch1,name=vdagent,clipboard=on -device virtio-serial -device virtserialport,chardev=ch1,id=ch1,name=com.redhat.spice.0 -qmp "tcp:127.0.0.1:5955,server,nowait"

Attach nbd drive in windows using wnbd

PS E:\vstorage\nbd\wnbd_client> wnbd-client.exe map --hostname localhost --port 10809 --instance-name gkpics01

Mount/unmount in Linux-VM

[root@fedora gana]# mount -t ntfs3 /dev/vdd2 /mnt/b
[root@fedora gana]# mount -t ntfs3 /dev/vdc1 /mnt/a
[root@fedora gana]# umount /mnt/a
[root@fedora gana]# umount /mnt/b

The ntfs3 fs-type argument is important, as Fedora 36 Linux, by default uses fuseblk driver

[root@fedora gana]# cat /etc/mtab | grep vdc1
without -t ntfs3 fstype-argument
/dev/vdc1 /mnt/a fuseblk rw,relatime,user_id=0,group_id=0,allow_other,blksize=4096 0 0
vs with -t ntfs3 fstype argument
/dev/vdc1 /mnt/a ntfs3 rw,relatime,uid=0,gid=0,iocharset=utf8 0 0

Add/Remove drive letter in Windows before unmapping

Get-Volume -FriendlyName fs_gkpics01 | Get-Partition | Add-PartitionAccessPath -accesspath "K:"
Get-Volume -Drive "K" | Get-Partition | Remove-PartitionAccessPath -accesspath "K:"
Set-Disk -Number $diskNumber -IsOffline $true

Do copy/write in Linux-VM

[root@fedora gana]# ( fl="photos001" ; src="/mnt/b" ; dst="/mnt/a" ; sdate=`date` ; cd "$src" ; rsync -avH "$fl" "$dst"  ; echo "$sdate" ; date ; sudo -u gana DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/1000/bus DISPLAY=:0.0 -- notify-send "$src/$fl" "rsync $src/$fl" )

Do copy/write in windows

Use windows explorer gui in order to copy paste data from src folder to dst folder Using xcopy:

ps> pwsh -Command { $sdate=$(date) ; echo "$sdate" ; xcopy /E E:\test\sgdata K:\ ; echo "$sdate" ; echo "$(date)" }
: 
15377 File(s) copied
09/25/2022 15:24:13
09/25/2022 15:45:21 

Sync command in windows, from sysinternal command line utilities

PS C:\> sync K:
Sync v2.2 - Flush cached data to disk.
Copyright (C) 2016 Mark Russinovich
Sysinternals - www.sysinternals.com
Flushing: K

Sync command on Linux-VM

[root@fedora gana]# sync

Unmap/disconnect/detach drive in windows

PS E:\vstorage\nbd\wnbd_client> .\wnbd-client.exe unmap gkpics01
libwnbd.dll!WnbdRemoveEx WARNING Soft device removal failed. Falling back to hard removal. This happens even if I do Drive-letter-removal in diskmgmt

unmapping drive takes about 30 seconds before anything happens. Note the error message.

Verify copied data in Linux-VM

[root@fedora gana]# # verification under linux-VM
[root@fedora gana]# mount /dev/vdc1 /mnt/a
[root@fedora gana]# ( fl="photos001" ; rtpt="/mnt/a" ; cd "${rtpt}/${fl}" ; sdate=`date` ; echo "$sdate" ; sha256sum -c "$rtpt/$fl/find.CHECKSUM" --quiet ; echo "$sdate" ; date ; sudo -u gana DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/1000/bus DISPLAY=:0.0 -- notify-send "$src/$fl" "checksum $src/$fl" )
Sun Sep 25 11:51:34 AM IST 2022
Sun Sep 25 11:51:34 AM IST 2022
Sun Sep 25 12:13:53 PM IST 2022

Verify copied data in windows

PS C:\Users\gana> powershell -Command { $fl="photos001" ; $rtpt="K:" ; cd "${rtpt}/${fl}" ; $sdate=$(date) ; echo "$sdate" ; C:\vol\scoop_01\scoopg\apps\git\current\usr\bin\sha256sum.exe -c "$rtpt/$fl/find.CHECKSUM" --quiet ; echo "$sdate" ; echo "$(date)"  }
C:\Users\gana\OneDrive\Documents\WindowsPowerShell
09/25/2022 15:55:43
09/25/2022 15:55:43
09/25/2022 16:07:55

The error message during Unmap/disconnect/detach drive in windows happens even if I manually remove-drive-letter in diskmgmt-gui before unmapping

Q1) how can one unmount the drive without any worry causing error

Tests [01,02,03,04] were done using fixed-size raw image. I will do more tests. Try and see what happens with expanding virtual-disks, but as you say, that puts wnbd partly off the hook. However, the wnbd driver should still handle it without lockup-disaster even if the nbd server is slow to catch up.

petrutlucian94 commented 2 years ago

Just wondering: have you attached the same image to a windows host and linux vm simultaneously? That's likely going to corrupt the ntfs filesystem. In general, only a cluster aware filesystem can be safely attached to multiple hosts without risking data corruption (e.g. Windows Cluster Shared Volume, but that requires persistent SCSI reservations, which we don't support yet).

Q1) how can one unmount the drive without any worry causing error

Wnbd tries to do a "soft" unmap first, which notifies the Windows storage stack the the disk is going to be removed, letting other drivers flush their data before proceeding with the unmap. By default, we'll wait 30s before doing a hard remove as a fallback. If --debug is specified, wnbd-client.exe will specify which driver is preventing the disk from being unmapped.

Here are the parameters that unmap can receive:

Usage: wnbd-client unmap <instance-name>
                         [--hard-disconnect] [--no-hard-disconnect-fallback]
                         [--soft-disconnect-timeout <arg>]
                         [--soft-disconnect-retry-interval <arg>]

Remove disk mapping.

Positional arguments:
  <instance-name>                          Disk identifier.

Optional arguments:
  [--hard-disconnect]                      Perform a hard disconnect. Warning:
                                           pending IO operations as well as
                                           unflushed data will be discarded.
  [--no-hard-disconnect-fallback]          Immediately return an error if the
                                           soft disconnect fails instead of
                                           attempting a hard disconnect as
                                           fallback.
  [--soft-disconnect-timeout <arg>]        Soft disconnect timeout in seconds.
                                           The soft disconnect operation uses
                                           PnP to notify the Windows storage
                                           stack that the device is going to be
                                           disconnected. Storage drivers can
                                           block this operation if there are
                                           pending operations, unflushed caches
                                           or open handles. Default: 30.
  [--soft-disconnect-retry-interval <arg>] Soft disconnect retry interval in
                                           seconds. Default: 2.

Common arguments:
  [--debug]                                Enable debug logging.
hgkamath commented 2 years ago

Nope, I am certain to make sure the raw-image isn't simultaneously used by nbd-client and VM. By default, qemu-nbd is set to disconnect when a client disconnects. (There another option to let it remain online for a re-connection. -t, --persistent don't exit on the last connection, but I haven't done that).

Yes, I know it would be very bad if two non-co-operating processes use and write-to the image simultaenously, not my intent.

There could be a bug there if soft-disconnect is not happening. I don't need to do any data transfer for that to happen. Just

  1. start qemu-nbd,
  2. map drive using wnbd
    • Check diskmgmt.msc to make sure a drive letter is allocated
  3. Unmap drive using wnbd
    • Doesn't matter if drive-letter is removed inside diskmgmt. same error either case

In my tests, there is no data error, on account of hard-disconnect. I ensure to use sync command before unmapping. But then, as I said, it is a little worry.

In the end goal, I envision to have, a 2Tb disk with many 100Gb vdisks on it, to be attached on a need basis.

  1. some vhdx vdisks to be used windows only. diskmgmt can attach those directly. Ex scoop, windows-progs, (Hopefully I replace all with qcow2 as in 2.
  2. some qcow2 vdisks may provide data-vdisks to windows on a need basis, using wnbd.
  3. most qcow2 vdisks are be attached to VM, mounted and used within the VM, and then perhaps just use samba/CIFS if it is required to share any data folder to the host-OS

Ideally, I wanna do away with different formats, and go with the ONE TRUE VDISK FORMAT that works well with Windows-OS or Linux-OS, be it hyperv or kvm/qemu, be it attached to HOST-OS or Guest-VM-OS

I hope to make my system as follows.

In the comment before last, you mentioned to get rid of qemu. Did you have another open-source server in mind for me to try on the windows host. I thought ceph-server needs to run on a linux-host. Also, the whole cluster-FS is too much overkill for a single user single machine system.

petrutlucian94 commented 2 years ago

There could be a bug there if soft-disconnect is not happening.

Well, if there's another driver from the storage stack that's preventing us from doing a soft disconnect, there's not much we can do. We can either return an error or do a force unmap, which can be configured using the unmap parameters. Again, it might be useful to know who's holding the disk (using the --debug parameter).

In the comment before last, you mentioned to get rid of qemu. Did you have another open-source server in mind for me to try on the windows host. I thought ceph-server needs to run on a linux-host. Also, the whole cluster-FS is too much overkill for a single user single machine system.

indeed, ceph was definitely not intended for single host environments. if qemu-nbd causes problems, you may also consider lvm + iscsi or even zfs + iscsi.

hgkamath commented 2 years ago

A dummy map and unmap cycle: As I had removed drive letter in diskmgmt in a previous attempt, when wnbd-map command attaches the drive, diskmgmt remembers not to assign a drive letter. So on this trial run, there is no need to even do sync or remove-drive-letter. Nevertheless, unmap soft removal fails. All commands are executed under a minute.

PS E:\vstorage\nbd\wnbd_client> wnbd-client.exe map --hostname localhost --port 10809 --instance-name gkpics01 --debug
libwnbd.dll!WnbdOpenAdapterEx DEBUG Opening WNBD adapter device: \\?\root#scsiadapter#0000#{949dd17c-b06c-4c14-8e13-f1a3a4a6dbcb}

PS C:\> wnbd-client.exe list
Pid         DiskNumber  Nbd    Owner            InstanceName
2936        10          true   wnbd-client      gkpics01

PS C:\> wnbd-client.exe stats --instance-name gkpics01
Disk stats
TotalReceivedIORequests        : 104
TotalSubmittedIORequests       : 104
TotalReceivedIOReplies         : 104
UnsubmittedIORequests          : 0
PendingSubmittedIORequests     : 0
AbortedSubmittedIORequests     : 0
AbortedUnsubmittedIORequests   : 0
CompletedAbortedIORequests     : 0
OutstandingIOCount             : 0

PS E:\vstorage\nbd\wnbd_client> wnbd-client.exe unmap --instance-name gkpics01 --debug
libwnbd.dll!WnbdOpenAdapterEx DEBUG Opening WNBD adapter device: \\?\root#scsiadapter#0000#{949dd17c-b06c-4c14-8e13-f1a3a4a6dbcb}
libwnbd.dll!WnbdOpenAdapterEx DEBUG Opening WNBD adapter device: \\?\root#scsiadapter#0000#{949dd17c-b06c-4c14-8e13-f1a3a4a6dbcb}
libwnbd.dll!PnpRemoveDevice DEBUG Could not remove device. CM status: 23, veto type 5, veto name: SCSI\Disk&Ven_WNBD&Prod_WNBD_DISK\1&2afd7d61&2&000500. Time elapsed: 5.82s
libwnbd.dll!PnpRemoveDevice DEBUG Could not remove device. CM status: 23, veto type 5, veto name: SCSI\Disk&Ven_WNBD&Prod_WNBD_DISK\1&2afd7d61&2&000500. Time elapsed: 13.63s
libwnbd.dll!PnpRemoveDevice DEBUG Could not remove device. CM status: 23, veto type 5, veto name: SCSI\Disk&Ven_WNBD&Prod_WNBD_DISK\1&2afd7d61&2&000500. Time elapsed: 21.45s
libwnbd.dll!PnpRemoveDevice DEBUG Could not remove device. CM status: 23, veto type 5, veto name: SCSI\Disk&Ven_WNBD&Prod_WNBD_DISK\1&2afd7d61&2&000500. Time elapsed: 29.31s
libwnbd.dll!PnpRemoveDevice DEBUG Could not remove device. CM status: 23, veto type 5, veto name: SCSI\Disk&Ven_WNBD&Prod_WNBD_DISK\1&2afd7d61&2&000500. Time elapsed: 37.09s
libwnbd.dll!WnbdRemoveEx WARNING Soft device removal failed. Falling back to hard removal.
petrutlucian94 commented 2 years ago

That's useful information, the veto type is set to 5 (PNP_VetoOutstandingOpen) [1], so there's probably an open handle.

I usually use Process Hacker or the sysinternals tools to identify open handles. You could also try setting the disk offline before unmapping it:

Set-Disk -Number $diskNumber -IsOffline $true

[1] https://learn.microsoft.com/en-us/windows/win32/api/cfg/ne-cfg-pnp_veto_type#constants

hgkamath commented 2 years ago

I determined what the cause is.

I usually keep a admin privilege taskmgr.exe process running for emergency troubleshooting. If taskmgr is closed, soft terminations happens. Why does taskmgr do that? It probably shouldn't

The EventManager-LogViewer eventvwr.msc had the following info

map:
warn 4:37:15 NbdNegotiate:296 Ignoring unsupported NBD reply info type: 3
Info 4:37:15 Volume \\?\Volume{27ec1801-27fc-4877-b1dc-0b3a137bceb5} (\Device\HarddiskVolume76) is healthy.  No action is needed.

unmap:
Warn 4:37:36 The application \Device\HarddiskVolume4\Windows\System32\Taskmgr.exe with process id 36964 stopped the removal or ejection for the device SCSI\Disk&Ven_WNBD&Prod_WNBD_DISK\1&2afd7d61&2&000700.
Info 4:37:39 Volume \\?\Volume{27ec1801-27fc-4877-b1dc-0b3a137bceb5} (\Device\HarddiskVolume76) is healthy.  No action is needed.
Warn 4:37:45 The application \Device\HarddiskVolume4\Windows\System32\Taskmgr.exe with process id 36964 stopped the removal or ejection for the device SCSI\Disk&Ven_WNBD&Prod_WNBD_DISK\1&2afd7d61&2&000700.
Info 4:37:45 Volume \\?\Volume{27ec1801-27fc-4877-b1dc-0b3a137bceb5} (\Device\HarddiskVolume76) is healthy.  No action is needed.
Warn 4:37:53 The application \Device\HarddiskVolume4\Windows\System32\Taskmgr.exe with process id 36964 stopped the removal or ejection for the device SCSI\Disk&Ven_WNBD&Prod_WNBD_DISK\1&2afd7d61&2&000700.
Info 4:37:55 Volume \\?\Volume{27ec1801-27fc-4877-b1dc-0b3a137bceb5} (\Device\HarddiskVolume76) is healthy.  No action is needed.
Warn 4:38:01 The application \Device\HarddiskVolume4\Windows\System32\Taskmgr.exe with process id 36964 stopped the removal or ejection for the device SCSI\Disk&Ven_WNBD&Prod_WNBD_DISK\1&2afd7d61&2&000700.
Info 4:38:01 Volume \\?\Volume{27ec1801-27fc-4877-b1dc-0b3a137bceb5} (\Device\HarddiskVolume76) is healthy.  No action is needed.
Warn 4:38:08 The application \Device\HarddiskVolume4\Windows\System32\Taskmgr.exe with process id 36964 stopped the removal or ejection for the device SCSI\Disk&Ven_WNBD&Prod_WNBD_DISK\1&2afd7d61&2&000700.
Warn 4:38:08 NbdReadExact:36 Failed with : 0
Err  4:38:38 NbdReadReply:490 Could not read command reply.
Warn 4:38:38 Disk 10 has been surprise removed.

Same thing happens, even if volume/disk is set to offline using the Set-Disk command. diskmgmt shows disk having gone offline.

Apparently, that's an M$ thing

petrutlucian94 commented 2 years ago

If taskmgr is closed, soft terminations happens. Why does taskmgr do that? It probably shouldn't

taskmgr provides various disk IO statistics, so I assume it holds some disk handles. That's normal behavior, nice catch though.

Warn 4:38:08 NbdReadExact:36 Failed with : 0 Err 4:38:38 NbdReadReply:490 Could not read command reply.

That message gets logged when an nbd connection is closed. I've added a TODO item to avoid logging an error since that can be confusing.

hgkamath commented 1 year ago

20230414 wnbd/qsd: Documenting a stuck on xcopy situation for reproduction

PS C:\Windows\system32>  D:\vstorage\nbd\wnbd_client\wnbd-client.exe version
wnbd-client.exe: 0.4.1-10-g5c5239c
libwnbd.dll: 0.4.1-10-g5c5239c
wnbd.sys: 0.4.1-10-g5c5239c

Copy test

.

Event log

In the below, who is \Device\RaidPort4? that raises event ID-129.
I don't even use raid.
Laptop has 3 disks sata-ssd, sata-hdd and sata-hdd via a msata-cdrom-caddy.
I later gather this raidport# number is whatever wnbd attaches itself as on the underside to storport.
Looking into the eventlog filter for event-id-129, source is usually wnbd.
Maybe, the number of times it shows up corresponds to number of times, I've let system get stuck by this test.
In the past, wnbd has raised event-id-129 as raidport1, raidport6, raidport10, so it depends on what it gets assigned during driver insertion.
The binary data payload seems to be the same for each event-id-129.

The below log is from a different run.

D:\test\sgdata\photos001\D0020_bf23a7e9\File_0452.bin
File creation error - A device which does not exist was specified.

04/19/2023 13:57:35
04/19/2023 14:07:43

I looked at eventvwr.msc Windows-logs/System category.
This is the first event after a event-lul period , about 9 minutes into the xcopy step.

Log Name:      System
Source:        wnbd
Date:          4/19/2023 2:06:34 PM
Event ID:      129
Task Category: None
Level:         Warning
Keywords:      Classic
User:          N/A
Computer:      Capella-PC
Description:
The description for Event ID 129 from source wnbd cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.

If the event originated on another computer, the display information had to be saved with the event.

The following information was included with the event: 

\Device\RaidPort4

The message resource is present but the message was not found in the message table

Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="wnbd" />
    <EventID Qualifiers="32772">129</EventID>
    <Version>0</Version>
    <Level>3</Level>
    <Task>0</Task>
    <Opcode>0</Opcode>
    <Keywords>0x80000000000000</Keywords>
    <TimeCreated SystemTime="2023-04-19T08:36:34.8522246Z" />
    <EventRecordID>277544</EventRecordID>
    <Correlation />
    <Execution ProcessID="4" ThreadID="23460" />
    <Channel>System</Channel>
    <Computer>Capella-PC</Computer>
    <Security />
  </System>
  <EventData>
    <Data>\Device\RaidPort4</Data>
    <Binary>0F001800010000000000000081000480010000000000000000000000000000000000000000000000000000000000000000000000810004800000000000000000</Binary>
  </EventData>
</Event>

After this, there are many disk io errors, like so: There are 4 of these at 2:06:34

Log Name:      System
Source:        disk
Date:          4/19/2023 2:06:34 PM
Event ID:      153
Task Category: None
Level:         Warning
Keywords:      Classic
User:          N/A
Computer:      Capella-PC
Description:
The IO operation at logical block address 0x5f3038 for Disk 11 (PDO name: \Device\000000c0) was retried.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="disk" />
    <EventID Qualifiers="32772">153</EventID>
    <Version>0</Version>
    <Level>3</Level>
    <Task>0</Task>
    <Opcode>0</Opcode>
    <Keywords>0x80000000000000</Keywords>
    <TimeCreated SystemTime="2023-04-19T08:36:34.8522246Z" />
    <EventRecordID>277545</EventRecordID>
    <Correlation />
    <Execution ProcessID="4" ThreadID="23460" />
    <Channel>System</Channel>
    <Computer>Capella-PC</Computer>
    <Security />
  </System>
  <EventData>
    <Data>\Device\Harddisk11\DR34</Data>
    <Data>0x5f3038</Data>
    <Data>11</Data>
    <Data>\Device\000000c0</Data>
    <Binary>0F01040004002C0000000000990004800000000000000000000000000000000000000000000000000000022A</Binary>
  </EventData>
</Event>

This is after the reset adapter

Log Name:      System
Source:        disk
Date:          4/19/2023 2:07:39 PM
Event ID:      157
Task Category: None
Level:         Warning
Keywords:      Classic
User:          N/A
Computer:      Capella-PC
Description:
Disk 11 has been surprise removed.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="disk" />
    <EventID Qualifiers="32772">157</EventID>
    <Version>0</Version>
    <Level>3</Level>
    <Task>0</Task>
    <Opcode>0</Opcode>
    <Keywords>0x80000000000000</Keywords>
    <TimeCreated SystemTime="2023-04-19T08:37:39.6266117Z" />
    <EventRecordID>277550</EventRecordID>
    <Correlation />
    <Execution ProcessID="4" ThreadID="24996" />
    <Channel>System</Channel>
    <Computer>Capella-PC</Computer>
    <Security />
  </System>
  <EventData>
    <Data>\Device\Harddisk11\DR34</Data>
    <Data>11</Data>
    <Binary>0000000002003000000000009D000480000000000000000000000000000000000000000000000000</Binary>
  </EventData>
</Event>

Events that come after the previous: There are many ntfs-delayed-write-failed and application-popup-delayed-write-failed events, which are to be expected after disk removal.

hgkamath commented 1 year ago

Q) does the above eventvwr wnbd event-id-129 and its binary payload provide any debugging clue?

As in case of Azure link below, the wnbd interface to the network block device is an emulated device, and there no real physical scsi hardware for failure to happen. Perhaps if some nbd-communication times-out, it assumes a 129. Perhaps there is some file system maintenance for file-expansion or something that kicks in to cause the delay. Note: mine is a haswell quadcore machine, nbd-server and storport are colocated, and might not be as fast as machines you use.

Apparently, HyperV also had similar resets, which needed to be handled, as they were harmless emulated timing issues. https://support.microsoft.com/en-us/topic/many-id-129-events-are-logged-when-you-run-a-windows-server-2012-r2-virtual-machine-in-windows-azure-a9c9d4eb-7533-2704-6049-7428bbb565cb

qemu-storage-daemon, does not do any disk virtualization/physical-disk-emulation/bus-protocol-emulation, and its file/virtual-disk/block code processes data back-and-forth direct to nbd-export. Unlike qemu-nbd, which by default halts on client-connection-drop and requires a persistent command-line-argument to tell not to terminate, the qemu-storage-daemon process runs (persistent by default) until one does ctrl-C on it. So it should be possible to reconnect.

petrutlucian94 commented 1 year ago

It just means that the nbd connection hanged and storport is issuing LUN resets.

I'll need to:

petrutlucian94 commented 1 year ago

I've replicated the issue using qemu-storage-daemon.exe. The NBD connection eventually hangs and we can no longer send requests or receive replies. Windows becomes extremely unresponsive when that happens.

This doesn't affect rbd-wnbd, so the potential causes are:

Thanks again for the bug report, I'm reopening the issue. Sorry for the slow progress, our main focus is currently Ceph, we're providing the NBD client functionality as a best effort.

LE: NBD socket reads and writes hang indefinitely: https://pastebin.com/raw/6rrggyDs. We'll move the NBD client to userspace as soon as possible.

LLE: there seems to be a problem with qemu-storage-daemon on Windows. Wnbd works fine with Linux qemu-storage-daemon NBD servers or the standard Linux nbd-server.

I did prepare a draft that moves the NBD implementation to userspace, making it easier to maintain: https://github.com/cloudbase/wnbd/pull/122. I'll probably prepare another PR that forwards LUN resets to the wnbd daemon so that it may reset or remove the connection instead of blocking the host.

Looks like qemu's blk_co_pwrite hangs at some point: https://github.com/qemu/qemu/blob/6972ef1440a9d685482d78672620a7482f2bd09a/nbd/server.c#L2514. This might also affect Qemu VMs on Windows.

petrutlucian94 commented 1 year ago

@hgkamath I spent some time looking at qemu-storage-daemon, some WriteFile calls were hanging. The good news is that it seems to work fine when passing cache.direct=on, for example:

qemu-storage-daemon.exe  --blockdev driver=file,node-name=file,filename=e:\img\test_100gb.raw,cache.direct=on --blockdev driver=raw,node-name=raw,file=file --nbd-server addr.type=inet,addr.host=127.0.0.1,addr.port=10809 --export type=nbd,id=export,node-name=raw,name=test_100gb,writable=on
hgkamath commented 1 year ago

I'll try that out.

One thought, that I wanted to pass along, was that
I discovered qemu-win also uses mingw-winpthreads by default.
There is a bug, which I filed myself, https://gitlab.com/qemu-project/qemu/-/issues/1605
in which I found that the workaround is to use native threads with aio=native instead of the default mingw-winpthreads, the implicit option being aio=threads

I have encountered that particular qemu-bug only with rapid expansion of VHDX-dynamic vdisk-file (vhdx-dyn). The chance that it surfaces here is slim because, this test uses qcow2-fixed. a) for starters, qemu-s implementation of its codenative format qcow2 is much more performant than VHDX b) being a fixed vdisk, the block-allocation is complete and so this bug's vdisk file does not have to expand.

Also, your comment, also confirms it happens for raw format itself. Indeed, one should test in raw. This bug is independent of the qemu's vdisk format, and it is best to choose raw, to remove uncertainties arising due to qemu's vdisk implementations.

Nevertheless, I now test with a drive option aio=native just to reduce the chance of that. PS C:\Users\admin> qemu-storage-daemon.exe --blockdev driver=file,aio=native,node-name=file,filename=F:\gkpics01.qcow2 --blockdev driver=qcow2,node-name=qcow2,file=file --nbd-server addr.type=inet,addr.host=127.0.0.1,addr.port=10809 --export type=nbd,id=export,node-name=qcow2,name=gkpics01,writable=on

PS C:\Windows\system32> $(dir F:\gkpics01.qcow2).Length
99739238400
PS C:\Windows\system32> fsutil sparse queryflag F:\gkpics01.qcow2
This file is NOT set as sparse

Expectedly, I still did experience a hang with aio=native also.

In one of the hangs, it was possible to do an uninstall-driver, install-driver, map and resume/restart the xcopy without having to Ctrl-C the qemu-storage-daemon, meaning, that in that run, the qemu-storage-daemon nbd-server was not stuck and nbd reconnection succeeded.

From the qemu manual pdf pg41 :
cache.direct : The host page cache can be avoided with cache.direct=on. This will attempt to do disk IO directly to the guest’s memory. QEMU may still perform an internal copy of the data.
For cache.direct to work on windows. the file-open-flag O_DIRECT should should be supported on windows/mingw/win32 api, also the filesystem and filesystem-driver should support it.
win32 file has FILE_FLAG_NO_BUFFERING for CreateFile, ReOpenFile, file-buffering
FILE_FLAG_NO_BUFFERING maps to the O_DIRECT flag for the Linux open command.
So, it does. block/file-win32.c does use the flag.
In file-win32.c/file-posix.c, the function raw_parse_flags() sets the corresponding flags based on if BDRV_O_NOCACHE is set, which is set if cache.direct option is provided.
... just checking.

Its a good idea to quickly detect stuck situation, and react by uninstalling driver.
Superstition: I think, leaving machine unattended and risking it remaining in this stuck-state can lead to machine becoming slower and eventually having a more serious system-crash.
Techniques I use to quickly detect a stuck-situation

  1. Usually, for small files, one can see xcopy progressing. That clearly means system not stuck.
  2. For larger files, xcopy takes a long time on the file. So, if one keeps a notepad open and alternatively press-space-bar and ctrl-S (save) When machine is not stuck, the unsaved asterix indicator in the titlebar will toggle. When machine becomes suck, notepad becomes unresponsive on save.
  3. IDEA-FAIL: Run the below script in powershell. It should not block on file-access, and keep printing the time every 6 seconds. Not that I expect the file-size to change.
    PS C:\Windows\system32> $i=0 ; while ($true -and ($i -lt 400)) { "{0} {1}" -f $(date),$(dir F:\gkpics01.qcow2).Length ; sleep 6 ; $i = $i + 1 }
    The above turned out to be ineffective. So I tried the below
    PS C:\Windows\system32> $i=0 ; while ($true -and ($i -lt 400)) { "{0} {1} {2}" -f $(date),$(dir F:\gkpics01.qcow2).Length,$(dir F:\gkpics01.qcow2).LastWriteTime[0].ToString() ; "test" | Out-File -FilePath K:\tmpq\Downloads\a.txt ; sleep 6 ; $i = $i + 1 } But with this, I could not reproduce the stuck-state, it was acting like a keep-alive.

Another superstition that I am still watching out for is that I think if stuck happens its most likely at the really large files. Its not always the same file, but I will need to see if it is always one of the below large ones.
EDIT: This is maybe just that, a superstition, I started noting the ending file's file-size, and neither its, or the next's was always large, sometimes files were as small as 3Mib.
>512 MiB

D:\test\sgdata>forfiles /S /M * /C "cmd /c if @fsize GEQ 536870912 echo @path @fsize"
"D:\test\sgdata\photos001\D0011_325a7df4\D0004_ff4697dc\File_0000.bin" 870548550
"D:\test\sgdata\photos001\D0011_325a7df4\D0007_cfa36bee\File_0000.bin" 917280712
"D:\test\sgdata\photos001\D0011_325a7df4\D0007_cfa36bee\File_0001.bin" 1193905888
"D:\test\sgdata\photos001\D0014_127c4d62\File_0000.bin" 4294694852
"D:\test\sgdata\photos001\D0015_a42d4363\File_0000.bin" 2381007674
"D:\test\sgdata\photos001\D0015_a42d4363\File_0001.bin" 2103925866
"D:\test\sgdata\photos001\D0016_bcd34db1\File_0012.bin" 1223441112
"D:\test\sgdata\photos001\D0016_bcd34db1\File_0014.bin" 1498293582
"D:\test\sgdata\photos001\D0016_bcd34db1\File_0019.bin" 800657438
"D:\test\sgdata\photos001\D0017_6986e32f\File_0000.bin" 1889579172
"D:\test\sgdata\photos001\D0017_6986e32f\File_0001.bin" 1978039382
"D:\test\sgdata\photos001\D0020_bf23a7e9\File_1824.bin" 885577878
"D:\test\sgdata\photos001\D0020_bf23a7e9\File_1849.bin" 758393712
"D:\test\sgdata\photos001\D0020_bf23a7e9\File_1861.bin" 622607906
"D:\test\sgdata\photos001\D0020_bf23a7e9\File_1867.bin" 559405836
"D:\test\sgdata\photos001\D0021_1ab60b0c\File_1027.bin" 790637902
"D:\test\sgdata\photos001\D0021_1ab60b0c\File_1030.bin" 859589186
"D:\test\sgdata\photos001\D0024_fded2015\D0000_2b73ea53\File_0030.bin" 586790162
"D:\test\sgdata\photos001\D0024_fded2015\D0000_2b73ea53\File_0032.bin" 710406956
"D:\test\sgdata\photos001\D0024_fded2015\D0000_2b73ea53\File_0034.bin" 632330336
"D:\test\sgdata\photos001\D0024_fded2015\D0000_2b73ea53\File_0036.bin" 662133078
"D:\test\sgdata\photos001\D0024_fded2015\D0000_2b73ea53\File_0040.bin" 1157315322
"D:\test\sgdata\photos001\D0024_fded2015\D0000_2b73ea53\File_0042.bin" 1553586422
"D:\test\sgdata\photos001\D0025_7bd77bae\File_0110.bin" 785962104
"D:\test\sgdata\photos001\D0025_7bd77bae\File_0112.bin" 689221632
"D:\test\sgdata\photos001\D0026_409200f8\File_0034.bin" 797661226
"D:\test\sgdata\photos001\D0026_409200f8\File_0036.bin" 1151132800
"D:\test\sgdata\photos001\D0030_2a29839c\File_0516.bin" 835986114
"D:\test\sgdata\photos001\D0031_b20f40a7\File_0045.bin" 828452346

>1 GiB

D:\test\sgdata>forfiles /S /M * /C "cmd /c if @fsize GEQ 1073741824 echo @path @fsize"

"D:\test\sgdata\photos001\D0011_325a7df4\D0007_cfa36bee\File_0001.bin" 1193905888
"D:\test\sgdata\photos001\D0014_127c4d62\File_0000.bin" 4294694852
"D:\test\sgdata\photos001\D0015_a42d4363\File_0000.bin" 2381007674
"D:\test\sgdata\photos001\D0015_a42d4363\File_0001.bin" 2103925866
"D:\test\sgdata\photos001\D0016_bcd34db1\File_0012.bin" 1223441112
"D:\test\sgdata\photos001\D0016_bcd34db1\File_0014.bin" 1498293582
"D:\test\sgdata\photos001\D0017_6986e32f\File_0000.bin" 1889579172
"D:\test\sgdata\photos001\D0017_6986e32f\File_0001.bin" 1978039382
"D:\test\sgdata\photos001\D0024_fded2015\D0000_2b73ea53\File_0040.bin" 1157315322
"D:\test\sgdata\photos001\D0024_fded2015\D0000_2b73ea53\File_0042.bin" 1553586422
"D:\test\sgdata\photos001\D0026_409200f8\File_0036.bin" 1151132800
hgkamath commented 1 year ago

Confirming that the workaround cache.direct=on gave me a successful run.
Will do future tests both with and without it.

I theorize that be removing the OS/host-caching from the picture, probably increases the responsivity (time to respond to wnbd-client), and hence makes it less likely for the protocol to trip and lead into stuck-state.
Probably WinOS sends processes to sleep state during the time of data going into cache and out.
The question becomes: is it winOS tripping on caching, qemu tripping on caching, or wnbd client tripping due to delay.

PS C:\Windows\system32> qemu-storage-daemon.exe  --blockdev driver=file,aio=native,node-name=file,filename=F:\gkpics01.qcow2,cache.direct=on --blockdev driver=qcow2,node-name=qcow2,file=file --nbd-server addr.type=inet,addr.host=127.0.0.1,addr.port=10809 --export type=nbd,id=export,node-name=qcow2,name=gkpics01,writable=on

PS C:\Windows\system32> powershell -Command { $fl="photos001" ; $rtpt="J:" ; cd "${rtpt}/${fl}" ; $sdate=$(date) ; echo "$sdate" ; C:\vol\scoop_01\scoopg\apps\git\current\usr\bin\sha256sum.exe -c "$rtpt/$fl/find.CHECKSUM" --quiet ; echo "$sdate" ; echo "$(date)"  }
:
15377 File(s) copied
05/18/2023 17:33:06
05/18/2023 17:50:19

# unmap remap
PS C:\Windows\system32> D:\vstorage\nbd\wnbd_client\wnbd-client.exe unmap gkpics01
libwnbd.dll!PnpRemoveDevice WARNING Could not remove device. CM status: 23, veto type 6, veto name: STORAGE\Volume\{50a6c206-f464-11ed-9807-3464a9d34f26}#0000000001000000. Time elapsed: 1.87s, time left: 28s. Check the Windows System event log (Kernel-PnP source) for more information about the process that is holding the disk.
PS C:\Windows\system32>  D:\vstorage\nbd\wnbd_client\wnbd-client.exe map --hostname localhost --port 10809 --instance-name gkpics01
PS C:\Windows\system32>

PS C:\Windows\system32> powershell -Command { $fl="photos001" ; $rtpt="J:" ; cd "${rtpt}/${fl}" ; $sdate=$(date) ; echo "$sdate" ; C:\vol\scoop_01\scoopg\apps\git\current\usr\bin\sha256sum.exe -c "$rtpt/$fl/find.CHECKSUM" --quiet ; echo "$sdate" ; echo "$(date)"  }
05/18/2023 17:50:49
05/18/2023 17:50:49
05/18/2023 18:06:20
petrutlucian94 commented 1 year ago

Thanks for the confirmation. It doesn't seem related to NBD or the wnbd client. Qemu's WriteFile calls [1] are hanging unless we disable caching. I didn't dig too much into qemu's caching related code so I'm not sure if apart from FILE_FLAG_NO_BUFFERING [2] there are any other changes when passing cache-direct=on.

I wouldn't completely exclude some Windows overlapped IO weirdness, we had some similar issues [3].

[1] https://github.com/qemu/qemu/blob/d27e7c359330ba7020bdbed7ed2316cb4cf6ffc1/block/file-win32.c#L86 [2] https://github.com/qemu/qemu/blob/d27e7c359330ba7020bdbed7ed2316cb4cf6ffc1/block/file-win32.c#L281 [3] https://github.com/cloudbase/wnbd/pull/30

hgkamath commented 1 year ago

It feels like we're getting somewhere. Much thanks to your effort.
Most of what I write here I probably just verbiage to what you wrote succinctly.
(also do note that I am a user not a developer, so terms I say might not be correct)

In the qemu-project, on the windows side of things, there are only maintainers and no developers. Windows bugs get fixed if it was a happy side-effect of something else they were doing, usually on linux, or when windows users who have done the troubleshooting, nail something down, point out what-exactly has to be done. There aren't that many advanced-level windows-users.

Just to remember: When drive-lockup/system-stuck happens, only open windows/applications respond, until they are forced to do some disk-access. If an application attempts to do something with the disk, it freezes and becomes unresponsive. Ex notepad, powershell command load, etc.

Either one of following is true: 1) the WriteFile() is just yet another symptom of the drive-lockup/system-stuck (the chain of events leading to drive-lockup being caused elsewhere) , which would be a possibility if, as you said, if cause could be pinned to something cache.direct=on did elsewhere, apart from setting FILE_FLAG_NO_BUFFERING, wherever-else BDRV_O_NOCACHE influences ... and the overlapped-io-weirdness 2) or the WriteFile(), in conjunction with other asynchronous read/write calls, due to some combination of faulty flags and structure-parameters, is causing the block/deadlock, and everything is stuck because of it 3) or the WriteFile() is actively part of a cylic-chain of components that are in deadlock (cyclic-deadlock)

If it is a cyclic deadlock

Maybe,

.

PS C:\Windows\system32> qemu-storage-daemon.exe --blockdev driver=file,aio=native,node-name=file,filename=F:\gkpics01.qcow2 --blockdev driver=qcow2,node-name=qcow2,file=file --nbd-server addr.type=inet,addr.host=127.0.0.1,addr.port=10809 --export type=nbd,id=export,node-name=qcow2,name=gkpics01,writable=on

PS C:\Windows\system32> xcopy /?
:
 /J           Copies using unbuffered I/O. Recommended for very large files.
:

PS C:\Windows\system32> $sdate=$(date) ; echo "$sdate" ; xcopy /J /E D:\test\sgdata\photos001 J:\photos001 ; echo "$sdate" ; $edate=$(date) ; echo "$edate

15377 File(s) copied
05/19/2023 15:00:32
05/19/2023 15:18:43

PS C:\Windows\system32> powershell -Command { $fl="photos001" ; $rtpt="J:" ; cd "${rtpt}/${fl}" ; $sdate=$(date) ; echo "$sdate" ; C:\vol\scoop_01\scoopg\apps\git\current\usr\bin\sha256sum.exe -c "$rtpt/$fl/find.CHECKSUM" --quiet ; echo "$sdate" ; echo "$(date)" }

05/19/2023 15:23:59
05/19/2023 15:23:59
05/19/2023 15:39:13

Unbuffered write

Its also possible that if the nbd-client part of wnbd were jettisoned into userspace, the the OS/kernel would not be insistent that wnbd should respond immediately or within short timebound, allowing for a slower or delayed response from the emulated drive..

Q) Another question is to learn from you. Requesting you to document what you did, and what tools you used. As seen in bug qemu-#1605 so far, I am able to compile my own qemu-8.0 with debuginfo symbols using mingw, and do some very basic troubleshooting using gdb like putting assert() /printf() statements and inspecting stack frame variables.
How did you troubleshoot this? How did you know qemu was stuck on writefile()

petrutlucian94 commented 1 year ago

That's a very interesting observation about the cache dependency. I'd be curious to see what happens when connecting to qemu-storage-daemon from another host/vm (either Linux or Windows), thus avoiding this dependency.

So the idea is that WNBD disk writes are cached and so are the underlying qemu image file writes. This might lead to a deadlock when Windows attempts to flush its cache. I'm not really familiar with the Windows cache manager internals, the guys from OSR (https://community.osr.com) might be able to provide more insight (some Microsoft engineers used to post there). However, it's highly likely that writes are blocked while there's a pending cache flush, despite targeting a different disk.

If this situation is confirmed, we could document something along those lines:

When using WNBD to connect to a local NBD server, make sure to disable
caching either at the WNBD disk level or at the NBD server level. Windows cache
flushes may otherwise deadlock.

For example, when using ``qemu-storage-daemon.exe`` locally, make sure to pass
``cache.direct=on`` like so:

    qemu-storage-daemon.exe  \
      --blockdev driver=file,node-name=file,filename=e:\img\test_100gb.raw,cache.direct=on \
      --blockdev driver=raw,node-name=raw,file=file \
      --nbd-server addr.type=inet,addr.host=127.0.0.1,addr.port=10809 \
      --export type=nbd,id=export,node-name=raw,name=test_100gb,writable=on

Its also possible that if the nbd-client part of wnbd were jettisoned into userspace, the the OS/kernel would not be insistent that wnbd should respond immediately or within short timebound, allowing for a slower or delayed response from the emulated drive..

I've been testing using the userspace nbd client proposed here: https://github.com/cloudbase/wnbd/pull/122. The behavior is the same.

Q) Another question is to learn from you. Requesting you to document what you did, and what tools you used. As seen in bug qemu-#1605 so far, I am able to compile my own qemu-8.0 with debuginfo symbols using mingw, and do some very basic troubleshooting using gdb like putting assert() /printf() statements and inspecting stack frame variables. How did you troubleshoot this? How did you know qemu was stuck on writefile()

I did mostly the same. I was initially considering using Ubuntu to cross compile Qemu for Windows, however it was easier to go with msys2 since it already had all the dependencies (e.g. glib).

Then I checked the nbd server code, looking for the main entry points for NBD IO operations. I was a bit familiar with the qemu block drivers since I wrote a few patches for vhd(x) about 10 years ago :). I knew that raw image file writes should eventually hit this: https://github.com/qemu/qemu/blob/d27e7c359330ba7020bdbed7ed2316cb4cf6ffc1/block/file-win32.c#L86.

I added a few log messages to track IO requests more easily: https://github.com/petrutlucian94/qemu/commit/b18da91174c73cebec8c1a2e8a13726f7c6687c4. Probably I could've just used qemu tracing, however I haven't used that in a while and it was easier to just add a few print messages :). I was running wnbd in debug mode and checking the last 10 requests or so before the deadlock occurred. I was mostly ensuring that the write address, length and nbd tag match the ones received by the nbd server.

Sure, a stack trace would've also told me about the WriteFile hang, however I was also looking for unusual IO patterns. When building qemu using mingw-gcc, the debug symbols use dwarf format, which can't be used by WinDBG. I usually use this gdb fork for opening Windows dumps (the upstream version doesn't support this feature): https://github.com/ssbssa/gdb/releases/tag/gdb-13.1.90.20230220.

Another option might be to compile qemu using mingw-llvm, which has .pdb symbol support, which can be handled by WinDBG.

hgkamath commented 1 year ago

In comment, 1551603847 I had this idea of using a script to detect drive-lock, quickly, visually without manual checking.

  1. PARTIAL-IDEA-FAIL: While xcopy is in progress, run the below script in powershell. It should not block on file-access, and keep printing the time every 6 seconds. Not that I expect the file-size to change.
    PS C:\Windows\system32> $i=0 ; while ($true -and ($i -lt 400)) { "{0} {1}" -f $(date),$(dir F:\gkpics01.qcow2).Length ; sleep 6 ; $i = $i + 1 }
    The above turned out to be ineffective. It probably got the filesize from RAM/fs-metadata-cache. So I tried the below
    PS C:\Windows\system32> $i=0 ; while ($true -and ($i -lt 400)) { "{0} {1} {2}" -f $(date),$(dir F:\gkpics01.qcow2).Length,$(dir F:\gkpics01.qcow2).LastWriteTime[0].ToString() ; "test" | Out-File -FilePath K:\tmpq\Downloads\a.txt ; sleep 6 ; $i = $i + 1 } But with this, I could not reproduce the stuck-state, it was acting like a keep-alive.
    while neither disabling filebuffering in xcopy nor in qsd [EDIT] I got the stuck-state to happen at-least once in this way.

It feels like it is less likely for the stuck-state to happen, (shows up much later into the run), when I just stay glued to the same desktop-screen with the powershell windows and do nothing else but wait for xcopy or notepad to become stuck, and hoping to see a stuck script as well. Either stuck-state happens but script does not stop, or stuck-state-never-happens. It feels like it is more likely for the stuck-state to happen (shows up sooner into the run), when I had multi-tasked to a Firefox window on a different desktop-screen (win10 multiple desktop feature), which is what I did most of the time. After a while, as one browses, drive-lock happens. When a new weblink/tab is opened, the Firefox window freezes/becomes unresponsive. Perhaps, using something like Firefox, creates contention for filebuffer cache. One might take longer notice the stuck state, if one only context switches between browser tabs for which Firefox already has the pages artifacts in memory. Its a good thing that even in the stuck-state/drive-lockup, when things like even start-menu and taskbar might freeze, one can switch between desktops using Ctrl-Win-Left and Ctrl-Win-Right Keyboard shortcuts. (Rapid switching between desktop screens does not increase chances of the bug)

So file-activity on the host may have an influence on this bug.

These hunches I get could be superstitions.

That something external to both qsd and wnbd has this effect makes me think. This could be a bug involving file-buffering inside Win10 kernel itself.

writes are blocked while there's a pending cache flush, despite targeting a different disk

A cache flush of files at a top layer storage (J:), become block-writes into a vdisk-file, that are also cached, the vdisk-file being located on a lower layer (F:). The Windows File Cache-Manager manages Caching per file object. ref: learn.ms/file-caching Maybe fs-filebuffer for F: caches a small no blocks of the single large vdisk file (93 GiB), perhaps those blocks from the last few accesses WriteFile() / ReadFile()
The qsd workaround works because when J: files are being flushed, the qemu-writeFile() F: no longer requires interception and decision by the Windows File Cache-Manager. The xcopy /J workaround works because, there is no J: layer file-buffer cache to flush that is intense enough to cause F: writes to block.

Not yet had a testrun where workarounds (cache.direct=off or xcopy /j) did not work, That's good. Though the reproducibility without the workaround is probabilistic, the reason I may have had more certainty of running into it, is because I used to let it this run in the background while doing other things, including using Firefox to browse. In the 85GiB copy, most stucks do happen under copying of 13 GiB (5 min), I get a bit antsy if xcopy passes that mark. But, I've also had a testrun wherein the stuck-state happened near to the end of the xcopy.

In the event success happens, one needs to repeat the test a few times, in order to get a statistical feel so as to differentiate between the following cases.

petrutlucian94 commented 1 year ago

I had a simple loop that was just copying files and it used to hang after about 5 minutes. After disabling the cache on the nbd server side, it transferred a few TBs overnight without problems.

The best thing we can do right now is to document this, I'll send a patch. Another option would be to forward LUN resets to the nbd daemon, however that would still be less than ideal: having IO stuck every 5 minutes for 30s (under load), until Storport issues a reset.

By the way, I've merged the PR that moves the NBD daemon to userspace (libwnbd): https://github.com/cloudbase/wnbd/pull/122.

LE: I've documented it here: https://github.com/cloudbase/wnbd/pull/123.

petrutlucian94 commented 1 year ago

After documenting the NBD caching issue and implementing stale connection detection (https://github.com/cloudbase/wnbd/pull/126), I think we can close this issue.

hgkamath commented 1 year ago

Yes, go ahead,

This is the corresponding issue on the qemu side, just for reference.
https://gitlab.com/qemu-project/qemu/-/issues/814
It only contains only occasional summaries as things cleared up.
I may need to update its description as many confounding issues cleared since time of first filing.

In a direct email communication that I had with qemu block-code developers, some points that came up :