Closed hgkamath closed 1 year ago
Hi,
Thank you for your bug report. From the link that you have provided, I understand that you've tested multiple nbd clients (wnbd as well as the Linux native nbd client), multiple qemu-nbd versions and image formats.
Since you're hitting the same issue with the Linux nbd client, I doubt this is a wnbd bug. I'd suggest narrowing down the problem, even taking nbd out of the picture completely. In your Qemu bug report, you're stating that some image formats are affected while some are not. You can verify this by directly attaching the images to Qemu vms, which would exercise the same Qemu block drivers, without actually using NBD.
My first assumption was there may be unflushed changes, but since you're unmounting the volumes before closing the nbd connection, that shouldn't be the case. It may be interesting to do a wnbd-client stats $mapping
before removing the nbd connection though.
Q) Is there are recommended version of qemu-nbd that is known to work reliably with WNBD and the two together have been well tested?
wnbd provides a generic NBD client, the qemu-nbd version shouldn't really matter. In case of a protocol incompatibility, the connection would be dropped and the log would contain some meaningful error.
By the way, WNBD-6.12.39.253
isn't the actual wnbd version, you may use wnbd-client version
to retrieve it.
Regards, Lucian
Thank-you for your reply.
Trying the qemu-VM from the inside is a good suggestion. qemu-system-x86_64 is very flaky under WHPX (windows hypervisor HyperV), perhaps I should consider setting it off. The qemu-vm is more reliable with Intel-haxm hypervisor. qemu tools like qemu-img and qemu-nbd are smaller independent binaries that do not engage qemu's virtualization/emulation, just the qemu-block framework. I will try it to do it some time, and see if corruption happens that way.
In Windows qemu-setup (https://qemu.weilnetz.de/
), the qemu-nbd.exe binary started getting built and included with the shipped qemu-setup since 2021-05-05, qemu-version 6.0. So this windows qemu nbd-server is a relatively recent addition to the set of available nbd peer/server/clients.
qemu-w64-setup-20210505.exe
qemu-nbd is a local nbd-client. Both SSD and HDD are connected to mini-SATA. Hence read/write speeds are much higher than what one may get over the network. resmon.exe claims 70 million bytes per sec though the qemu-nbd.exe process. Ordinary network speeds are often lower than native bus speeds, even though they claim 1gbps ethernet, consumer network traffic seldom really reaches the speed of the system-pci-express-bus. At this native machine speed, there is bound to be bottlenecks and buffer-fill/overfill.
Description of what is meant by when the machine is stuck:
When the stuck happens, it is lucky that some open GUI processes work/respond at all. I keep a few extra privileged cmd.exe pwsh.exe windows open. Any other process that makes any disk access goes into a presumably io-blocked "not responding state". It is not possible to invoke new programs. Hence, one must be very careful with the processes that do respond. As long as no disk access is made, processes such as resmon.exe, taskmgr.exe continue to refresh and display. GUI components like menus, selections work. Text in notepad.exe can be editted but if attempted to be saved will cause it to go into not-responding. CPU usage is cool, typing, menu GUI, mouse movements are normal. Those on-screen applications that work are responsive.
The only program that works is the cmd/pwsh windows. Attempting to change directory with cd
will also make the command shell become stuck. It does not matter which hard-disk is being accessed, SSD1, HDD1, HDD2 etc.
I think the reason why wnbd_client.exe can be made to run from the cmd.exe prompt is because its dll, exe images are memory cached, and so shell is not disk-io blocked. However, other than stat, uninstall driver other commands have a very very long time of getting a response from the wnbd-driver.
qemu, qcow2 and kernel-nbd are well tested and work pretty rock-solid-ly under Linux. This is the other main reason why this issue is most likely a separate bug (bug2-63) as its effect is seen in different pathway, and affect any type of virtual-disk-format including qcow2. In the original bug (bug1-727), only the virtual-disk-format VHDX is affected, everything else is fine.
Uninstalling and reinstalling the driver is the only way to unstuck the situation. At this point, all not-responding processes/apps come back alive. Until I discovered this, I was only force shutting down laptop. the following don't work ctrl-C or taskmgr-end-task on qemu-nbd or xcopy, attempt wnd-client unmap. The taskmgr will accept the end-task the first time on the io-blocked not-responding process, but will be unable to truly terminate it. If the taskmgr end-task attempted again on a previously end-tasked process, a popup dialog "action not allowed" is shown.
E:\vstorage\nbd\wnbd_client>wnbd-client.exe uninstall-driver
libwnbd.dll!RemoveAllWnbdDisks INFO Hard removing WNBD disk: gkpics01
libwnbd.dll!RemoveWnbdAdapterDevice INFO Removing WNBD adapter device. Hardware id: root\wnbd. Class GUID: {4D36E97B-E325-11CE-BFC1-08002BE10318}
libwnbd.dll!CleanDrivers INFO Removing WNBD driver: oem30.inf
E:\vstorage\nbd\wnbd_client>wnbd-client.exe install-driver E:\vstorage\nbd\wnbd_driver\wnbd.inf
libwnbd.dll!WnbdInstallDriver INFO Cleaned up stale WNBD adapters.
libwnbd.dll!InstallDriver INFO Installing WNBD driver: E:\vstorage\nbd\wnbd_driver\wnbd.inf
libwnbd.dll!CreateWnbdAdapter INFO Creating WNBD adapter. Class: SCSIAdapter. Hardware id: root\wnbd.
This also hints to me that it is the wnbd driver that is freezing the disk-accesses of the system. The desktop is logged-in as an unprivileged non-administrative user. The qemu-nbd.exe is an unprivileged process started/run in an unprivileged shell. Under normal use, while running, it is easily stopped with ctrl-C or taskmgr. It can't really do anything to bring the whole system down.
Summary of why bug2-63 is separate from bug1-727 and why it might be WNBD and not qemu-nbd.exe
Miscellaneous troubleshooting logs
PS E:\vstorage\nbd\wnbd_client> .\wnbd-client.exe list-opt
LogLevel : 1 (Default: 1)
NewMappingsAllowed : true (Default: true)
EtwLoggingEnabled : true (Default: true)
WppLoggingEnabled : false (Default: false)
DbgPrintEnabled : true (Default: true)
PS E:\vstorage\nbd\wnbd_client> .\wnbd-client.exe set-opt LogLevel 4
PS E:\vstorage\nbd\wnbd_client> .\wnbd-client.exe list-opt
LogLevel : 4 (Default: 1)
NewMappingsAllowed : true (Default: true)
EtwLoggingEnabled : true (Default: true)
WppLoggingEnabled : false (Default: false)
DbgPrintEnabled : true (Default: true)
PS E:\vstorage\nbd\wnbd_client> logman delete "WNBD_tracing_session"
The command completed successfully.
PS E:\vstorage\nbd\wnbd_client> logman create trace "WNBD_tracing_session" -p "{E35EAF83-0F07-418A-907C-141CD200F252}" "0xffffffff" "0xff" -o e:\tmpq\TraceFile_2.etl -rt
still trying to capture some logs.
Miscellaneous trouble shooting
qemu-w64-setup-20210505.exe (6.0.0)
qemu-w64-setup-20210619.exe (6.0.0) <- i used this
qemu-w64-setup-20211215.exe (6.2.0) <- earlier I had tried this
Below are the files one needs, and one can just extract these exe/dlls from the qemu-setup and cephpacific-setup without doing a full install
PS E:\vstorage\nbd\qemu-nbd-6.0> (Get-ChildItem).Name
iconv.dll
libbz2-1.dll
libcurl-4.dll
libffi-6.dll
libglib-2.0-0.dll
libgmp-10.dll
libgnutls-30.dll
libhogweed-4.dll
libidn2-0.dll
libintl-8.dll
libnettle-6.dll
libnghttp2-14.dll
libp11-kit-0.dll
libpcre-1.dll
libssp-0.dll
libtasn1-6.dll
libunistring-2.dll
libwinpthread-1.dll
libzstd-1.dll
qemu-nbd.exe
zlib1.dll
PS E:\vstorage\nbd\wnbd_client> (Get-ChildItem).Name
libwnbd.dll
wnbd-client.exe
PS E:\vstorage\nbd\wnbd_driver> (Get-ChildItem).Name
wnbd.cat
wnbd.inf
wnbd.sys
below are some wnbdclient stat captures. Usually the numbers OutstandingIO, pendingSubmittedIO and TotalReceivedIO add up. But start diverging after the stuck
happens. Thereafter, the wnbd-client collecting stat responses is not longer quick but takes quiet some time.
PS E:\vstorage\nbd\wnbd_client> ./wnbd-client.exe stats --instance-name gkpics01
Disk stats
TotalReceivedIORequests : 4645
TotalSubmittedIORequests : 4645
TotalReceivedIOReplies : 4645
UnsubmittedIORequests : 0
PendingSubmittedIORequests : 0
AbortedSubmittedIORequests : 0
AbortedUnsubmittedIORequests : 0
CompletedAbortedIORequests : 0
OutstandingIOCount : 0
When stuck happens,
Just when the-stuck has kicked in
E:\vstorage\nbd\wnbd_client>wnbd-client.exe stats --instance-name gkpics01
Disk stats
TotalReceivedIORequests : 8689
TotalSubmittedIORequests : 8689
TotalReceivedIOReplies : 8677
UnsubmittedIORequests : 0
PendingSubmittedIORequests : 12
AbortedSubmittedIORequests : 6
AbortedUnsubmittedIORequests : 0
CompletedAbortedIORequests : 0
OutstandingIOCount : 6
The below file (contains random bytes) in the xcopy of the synthetic generated data set is when the system become stuck
PS C:\Users\gana> pwsh -Command { $sdate=$(date) ; echo "$sdate" ; xcopy /E E:\test\sgdata K:\ ; echo "$sdate" ; date }
12/18/2021 15:41:18
E:\test\sgdata\photos001\find.CHECKSUM
:
E:\test\sgdata\photos001\D0000_88114831\File_0000.binE:\test\sgdata\photos001\D0011_325a7df4\D0007_cfa36bee\File_0000.bin
E:\vstorage\nbd\wnbd_client>dir H:
Volume in drive H is CPERF0
Volume Serial Number is F196-DB9E
Directory of H:\
12/18/2021 03:44 PM 6,194,987,008 gkpics01.vhdx
1 File(s) 6,194,987,008 bytes
0 Dir(s) 149,989,228,544 bytes free
E:\vstorage\nbd\wnbd_client>wnbd-client.exe stats --instance-name gkpics01
Disk stats
TotalReceivedIORequests : 8706
TotalSubmittedIORequests : 8695
TotalReceivedIOReplies : 8677
UnsubmittedIORequests : 11
PendingSubmittedIORequests : 18
AbortedSubmittedIORequests : 19
AbortedUnsubmittedIORequests : 9
CompletedAbortedIORequests : 0
OutstandingIOCount : 1
E:\vstorage\nbd\wnbd_client>wnbd-client.exe stats --instance-name gkpics01
Disk stats
TotalReceivedIORequests : 8841
TotalSubmittedIORequests : 8695
TotalReceivedIOReplies : 8677
UnsubmittedIORequests : 146
PendingSubmittedIORequests : 18
AbortedSubmittedIORequests : 19
AbortedUnsubmittedIORequests : 144
CompletedAbortedIORequests : 0
OutstandingIOCount : 1
In the below test I use a qemu VM to do copy from inside VM.
Preparation
qemu-img.exe create -f qcow2 H:\gkpics01.qcow2 99723771904
After booting Alpine Linux 3.15
setup-alpine
: us, us, localhost, eth0, dhcp, no-manual-nw, timezone, noproxy, select repository, openssh, nodrives, nocfg, default apkcacheapk add gzip rsync e2fsprogs sgdisk gptfdisk
mkdir /mnt/a /mnt/b
mount -t ntfs3 /dev/sdb2 /mnt/b
mount -t ntfs3 /dev/sda2 /mnt/a
mkfs -t ext4 -L fs_gkpics_01 /dev/sda1
mount -t ext4 /dev/sda1 /mnt/a
VHDX TEST
In cmd prompt:
C:\Windows> copy E:\vstorage\gkpics01_new.vhdx H:\gkpics01.vhdx
C:\Users\gana>dir E:\vstorage\gkpics01_new.vhdx
Volume in drive E is 15KJ_Images
Volume Serial Number is 1EA6-2771
Directory of E:\vstorage
11/12/2021 01:21 PM 205,520,896 gkpics01_new.vhdx
1 File(s) 205,520,896 bytes
0 Dir(s) 360,436,940,800 bytes free
PS E:\vstorage\nbd\qemu-nbd-6.0> & C:\vol\scoop_01\scoopg\shims\qemu-system-x86_64 -cpu qemu64 -m 4096 -machine "type=q35,kernel-irqchip=off" -accel whpx -smp "sockets=1,cores=8,threads=1" -boot d -cdrom E:\transcend\alpine-standard-3.15.0-x86_64.iso -hda H:\gkpics01.vhdx -hdb E:\test\sgdata.vhdx -display gtk -vga virtio -rtc base=utc -net user -device "virtio-net,netdev=vmnic" -netdev "user,id=vmnic,net=192.168.20.0/24,dns=192.168.20.3,dhcpstart=192.168.20.15"
C:\vol\scoop_01\scoopg\apps\qemu\current\qemu-system-x86_64.exe: warning: GdkPixbuf: Cannot open pixbuf loader module file 'C:\vol\scoop_01\scoopg\apps\qemu\current\lib\gdk-pixbuf-2.0\2.10.0\loaders.cache': No such file or directory
This likely means that your installation is broken.
Try running the command
gdk-pixbuf-query-loaders > C:\vol\scoop_01\scoopg\apps\qemu\current\lib\gdk-pixbuf-2.0\2.10.0\loaders.cache
to make things work again for the time being.
Windows Hypervisor Platform accelerator is operational
C:\vol\scoop_01\scoopg\apps\qemu\current\qemu-system-x86_64.exe: warning: hub 0 with no nics
Inside VM, prepare machine as mentioned earlier, setup-alpine .. mount partitions
localhost:/mnt# (sdate=`date` ; cd /mnt/b ; rsync -avH ./photos001 /mnt/a ; echo "$sdate" ; date )
localhost:/mnt# ls -l /mnt/a/photos001
The VHDX fails badly, even an ls -l did not succeed. Reboot was not required. nothing surprising here.
QCOW2 TEST
E:\vstorage\nbd\wnbd_client>C:\vol\scoop_01\scoopg\apps\qemu\current\qemu-img.exe create -f qcow2 H:\gkpics01.qcow2 99723771904
Formatting 'H:\gkpics01.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=99723771904 lazy_refcounts=off refcount_bits=16
E:\vstorage\nbd\wnbd_client>dir H:\gkpics01.qcow2
Volume in drive H is CPERF0
Volume Serial Number is F196-DB9E
Directory of H:\
12/19/2021 07:52 PM 198,144 gkpics01.qcow2
1 File(s) 198,144 bytes
0 Dir(s) 156,183,953,408 bytes free
PS E:\vstorage\nbd\qemu-nbd-6.0> & C:\vol\scoop_01\scoopg\shims\qemu-system-x86_64 -cpu qemu64 -m 4096 -machine "type=q35,kernel-irqchip=off" -accel whpx -smp "sockets=1,cores=8,threads=1" -boot d -cdrom E:\transcend\alpine-standard-3.15.0-x86_64.iso -hda H:\gkpics01.qcow2 -hdb E:\test\sgdata.vhdx -display gtk -vga virtio -rtc base=utc -net user -device "virtio-net,netdev=vmnic" -netdev "user,id=vmnic,net=192.168.20.0/24,dns=192.168.20.3,dhcpstart=192.168.20.15"
C:\vol\scoop_01\scoopg\apps\qemu\current\qemu-system-x86_64.exe: warning: GdkPixbuf: Cannot open pixbuf loader module file 'C:\vol\scoop_01\scoopg\apps\qemu\current\lib\gdk-pixbuf-2.0\2.10.0\loaders.cache': No such file or directory
This likely means that your installation is broken.
Try running the command
gdk-pixbuf-query-loaders > C:\vol\scoop_01\scoopg\apps\qemu\current\lib\gdk-pixbuf-2.0\2.10.0\loaders.cache
to make things work again for the time being.
Windows Hypervisor Platform accelerator is operational
C:\vol\scoop_01\scoopg\apps\qemu\current\qemu-system-x86_64.exe: warning: hub 0 with no nics
Inside VM, prepare machine as mentioned earlier, setup-alpine .. mount partitions rsync copy
localhost:/mnt# (sdate=`date` ; cd /mnt/b ; rsync -avH ./photos001 /mnt/a ; echo "$sdate" ; date )
:
Sent 90,963,665,782 bytes recevied 298,855 bytes 60,622,431.61 bytes/sec
total size is 90,940,594,995 speedup is 1.00
Sun Dec 19 20:19:26 IST 2021
Sun Dec 19 20:44:26 IST 2021
Verifying source drive
localhost:/mnt# (sdate=`date` ; cd /mnt/b/photos001 ; sha256sum -c ./find.CHECKSUM | tee /tmp/fb ; echo "$sdate" ; date )
:
Sun Dec 19 21:28:09 IST 2021
Sun Dec 19 21:48:35 IST 2021
localhost:/mnt# cat /tmp/fb | grep -v ": OK" | wc -l
0
Verifying the Destination drive containing with the rsynced/copied sgdata
localhost:/mnt# (sdate=`date` ; cd /mnt/a/photos001 ; sha256sum -c ./find.CHECKSUM | tee /tmp/fa ; echo "$sdate" ; date )
:
/D0039_1923a8ec/File_0002.bin: OK
Sun Dec 19 20:47:53 IST 2021
Sun Dec 19 21:07:39 IST 2021
localhost:/mnt# cat /tmp/fa | grep -v ": OK" | wc -l
202
!! QCOW2 is corrupted
It is true that there are bugs in both code pathway of VHDX and qcow2 qemu virtual-disks, which show up even without the code-pathway of nbd being involved.
However, it is also true that no system-stuck lockup happened.
The system lockup is the doing of the wnbd driver.
That is why I think wnbd itself does also have a bug.
The nbd protocol, and the wnbd kernel driver need to be robust/resilient towards failures in the nbd-peer. So if nbd peer is screwing up, at worst it should disconnect drive backend connection, or refuse service, or try re-transmission, and if it is re-transmission, it shouldn't get stuck in an infinite loop, locking up the whole system
The system drive-io lockup is reminiscent of situations like bad hard-drive bad-sector errors, or cdrom-read errors, where-in the windows-system get stuck trying to do a retry without ever succeeding, or taking a long time to move on. I don't think the my SSD drive really has problem blocks. Data integrity is fine when large virtual-disks are copied onto and off. So because qemu-nbd is screwing up, wnbd is locking up.
Summary posted to bug1-727 comment https://gitlab.com/qemu-project/qemu/-/issues/727#note_789170496
So in your case, some IO operations are hanging on the nbd server side. Windows (storport.sys to be precise) will cancel pending IO requests after about 30s, in which case that AbortedSubmittedIORequests
counter will be increased.
It's true that the Windows explorer usually hangs if the drives become unresponsive, but that's not WNBD's fault. It's the same with directly attached disk drives or SMB shares. WNBD simply receives IO requests from Storport and forwards them to an NBD server (or to a registered userspace app that uses libwnbd, but that's a different topic).
Storport miniport drivers, such as WNBD, are not supposed to retry canceled operations. At best, we may try to detect IO deadlocks and disconnect the disk, but that should be configurable.
A few other tests that you could make:
I'm going to close this issue since it's not a WNBD driver bug, but rather a qemu-nbd or environment issue.
sorry for not updating, This is on my mind, but didn't yet got to do it. I'm still yet to test qemu-7 , which just got recently released. qemu-project also released a new binary qemu-storage-daemon.exe
which also can act like a nbd server, amongst other things.
Just some updates. The kernel-5.15.x series upto 5.15.24, had ntfs3 read-side bugs, and upto 5.17.x series there were folio/mmotm feature commits and I had taken a pause until some other qemu/kernel bugs sorted out, so that there is less ambiguity and compounding. I don't think the wnbd driver is in the clear yet. But it is true, that some of the negative test-results above, that I did, after taking your very-good-suggestion of testing wholly inside a qemu-vm, were on account of these bugs.
I'll reopen this when I get to test it and it seems more sure that it isn't due to a qemu/linux-guest-kernel/linux-host-kernel issue
Also on a side note, I wish cloudbase released a regular-signed wnbd driver, which you should, if you think it is bug free. On my laptop, in order to use the testsigned driver, I need to use the 1 time disable device-driver-signature-enforcement special boot, or use the testsign mode which the HP-UEFI requires me to disable secureboot. The test-mode also puts a ugly wallpaper watermark. This hassle is what creates reluctance to test. Suppose things work well, I am not sure if I would go down a decision of mounting drives via wnbd this way on a regular basis, if it requires me to permanently turn off secureboot.
Thanks for feedback. We're currently shipping a signed version of the driver with the Ceph msi installer: https://cloudbase.it/ceph-for-windows/, I hope that helps. In the future, we may consider a standalone WNBD installer.
I think the ceph-msi-installer ceph_pacific_beta.msi
which is still beta is also just test-signed. (am i wrong?)
I had used the same before also.
certmgr
b. Delete cloudbase certificate under trusted-publishers/certificates .\wnbd-client.exe install-driver --filename 'C:\Program Files\Ceph\driver\wnbd.inf'
Key Usage
: Digital Signature (80)
code 52
Windows cannot verify the digital signature for the drivers required for this device. A recent hardware or software change might have installed a file that is signed incorrectly or damaged, or that might be malicious software from an unknown source. (Code 52)PS E:\vstorage\nbd\wnbd_client> .\wnbd-client.exe -v
wnbd-client.exe: 0.2.2-4-g10c1fbe
libwnbd.dll: 0.2.2-4-g10c1fbe
libwnbd.dll!WnbdOpenAdapterEx ERROR No WNBD adapter found. Please make sure that the driver is installed.
driverver from wnbd.inf: DriverVer = 04/28/2022,6.15.27.572
Right, it's signed using a Cloudbase certificate and not by Microsoft, which is why secure boot will have to remain disabled. However, you may disable the "testsigning" setting (allowing self signed drivers) when using the Ceph msi.
We decided to keep the "beta" label until we implement CI testing and get more user feedback. The good news is that a Windows Ceph CI job has been added recently [1] and that we got quite a few users already, which provide valuable feedback (as you did already).
A stable driver release will also go through the Microsoft driver attestation and signing process. Unfortunately, I can't provide an ETA but I hope that this will be ready by the next Ceph release.
[1] https://jenkins.ceph.com/job/ceph-windows-pull-requests/6850/
I have been chasing the same issue for a few months now. 16.2.7 ceph and windows 2019. I see this across 4 separate clusters. The larger the cluster the worse it seems to be ( ranging from 50 -340 osds per cluster )
We had a firmware bug where OSDs restarted randomly and this would trigger the windows RBD volume to lock up and never return. A restart of the windows server would sometimes cause refs to corrupt and go RAW. We have implemented snapshots to try mitigate this.
A planned reboot of a ceph node doesn't seem to cause the issue. Only a OSD assert.
The fault is characterized by PendingSubmittedIORequests going up but not aborts.
We are open to paying for support to correct this issue.
@Jazz9 thanks for feedback. The issue may be slightly different as the OP wasn't using RBD but actually NBD (wnbd can also act as a standalone nbd client) and it affected other NBD clients as well.
There might be some problem around aborted or timed out requests, so I think it's worth reopening this issue. The bug may also reside on the Ceph side (rbd-wnbd leverages wnbd), which may not be able to gracefully handle lost OSD connections (at least on Windows). Just checking: have you encountered similar issues with Linux RBD clients?
As a side note, there is a recent Ceph fix that replaces select() with poll() on Windows in order to avoid fd limitations, which can be a problem when having a large amount of OSDs: https://github.com/ceph/ceph/pull/46525.
We'll definitely need to investigate this issue. Unfortunately, I'm no longer actively working on this project (at least for now), so I can't provide an ETA. One of my colleagues may be able to reproduce it.
If needed, you may get in touch with our support team by filling this form: https://cloudbase.it/about/#contact
We found the issue. If all the IO dispatch workers are waiting for RBD callbacks indefinitely, the userspace (libwnbd) may never get notified of mapping removals. We'll need to improve the way in which the driver informs the IO dispatcher that the disk has been removed (e.g. by using a seperate control thread or some disconnect events).
We'll also have to avoid waiting for IO operations indefinitely on the rbd side (e.g. maybe double check our timeouts).
Great! Normally windows is really good at waiting for IO.
I would expect that the rbd side would need to timeout and retry the IO operation ( 20 sec timeout? ). That would assist if the OSD had restarted or the PG peered to another OSD.
Do you have a rough ETA on the fix?
There might be more to it since all rbd-wnbd IO callbacks are actually async and should return immediately (sometimes they don't and we see the hang). We even saw librbd::Image::close hang, the async version is likely going to hang as well.
It's a bit tricky to reproduce since on our env it only happens once per ~1000 rbd mappings. My colleague's been trying out pacific, quincy and the master ceph branch to see if it's a regression and collect some logs/process dumps.
While there are a few possible mitigations, we'll need to look into the hanging async calls which is probably the root cause of the issue. Unfortunately I can't provide an ETA, I'll try to assist the colleagues that are currently maintaining the project.
A small update: looks like we're affected by this winpthreads
bug that causes shared mutexes to deadlock: https://github.com/msys2/MINGW-packages/issues/3319. Trace: https://pastebin.com/raw/i3jpTyS3
Ideally, we should get winpthreads
fixed. The alternative would be to use a different shared mutex implementation. We did consider building Ceph using Clang or MSVC but that would require a large amount of code changes on the Ceph side.
Hello, mingw-w64 developer speaking. I have made some efforts to replace winpthreads due to its lack of maintainability. Do you have any interest in https://gcc-mcf.lhmouse.com/? The mcfgthread library doesn't implement RW locks, but libstdc++ has an alternative implementation using condition variables, so shared_mutex
also works there.
hi @lhmouse, mcfgthread seems interesting, thanks a lot for the suggestion.
We're currently cross-compiling Ceph using Linux, so we'll have to build mcfgthread and also gcc with the custom patches on top. Would you recommend using a specific gcc version? I see that there are some wip patches for gcc 12: https://github.com/lhmouse/MINGW-packages/tree/master/mingw-w64-gcc
Thank you for your work on mingw, btw :).
hi @lhmouse, mcfgthread seems interesting, thanks a lot for the suggestion.
We're currently cross-compiling Ceph using Linux, so we'll have to build mcfgthread and also gcc with the custom patches on top. Would you recommend using a specific gcc version? I see that there are some wip patches for gcc 12: https://github.com/lhmouse/MINGW-packages/tree/master/mingw-w64-gcc
Only the 9000-
and 9001-
patches are necessary; the later is required for building libgfortran, otherwise it can be dropped.
The mcfgthread library provides its own __cxa_atexit()
and __cxa_finalize()
implementations, which requires modification of the CRT. A complete set of patches (for both GCC and mingw-w64) is available at https://github.com/lhmouse/mcfgthread/tree/master/patches; please refer to the README.md file for details.
The patch for GCC should apply to GCC 11 and 12 branches cleanly. As far as I remember there haven't been any changes since GCC 9; probably it applies to 9 and 10 branches as well.
I built my personal toolchains using the standard cross-compilation procedure, which was necessary to replace the default CRT in MSYS2. It should also work when cross-compiling from Linux, I think.
Thank you for your work on mingw, btw :).
My pleasure.
It looks like you are close to a resolution on this issue. Do you have a ETA on the release?
It looks like you are close to a resolution on this issue. Do you have a ETA on the release?
the fix got approved, it's in the Ceph QA queue right now. I expect it to be merged in a couple of weeks. Afterwards, I'll backport it to Quincy and Pacific.
@petrutlucian94 - Did you have an update on the backport / availability of this fix?
@petrutlucian94 - Did you have an update on the backport / availability of this fix?
hi, the fix has been backported to pacific and quincy. there are a few more patches that we'd like to include before releasing a new MSI.
RC release of quincy 17.2.4 is due shortly 20220919 https://lists.ceph.io/hyperkitty/list/dev@ceph.io/thread/DUWEJXQH4XRU5UBQOH7IZXVJVCAMOMUJ/ I hope the task of making a signed windows driver for the final-release doesn't slip though the cracks.
RC release of quincy 17.2.4 is due shortly 20220919 https://lists.ceph.io/hyperkitty/list/dev@ceph.io/thread/DUWEJXQH4XRU5UBQOH7IZXVJVCAMOMUJ/ I hope the task of making a signed windows driver for the final-release doesn't slip though the cracks.
actually, we've just prepared Pacific and Quincy releases that include some of the fixes that didn't merge upstream yet. we've just finished testing it, so expect a new msi the following days.
for what is worth, the wnbd driver remains mostly the same, all the fixes that we had were actually targeting ceph, not the driver.
The Ceph MSIs have been updated: https://cloudbase.it/ceph-for-windows/
I think we can close this issue now. Just to clarify, we ended up using the same GitHub issue for two (or more) different problems:
I'm gonna test the latest beta ceh-quincy-17.2.4 with qemu-7.1 out of curiosity. Disabling secure-boot may be deterrent for me to long term transition to qemu<->nbd<->wnbd-nbd-driver<->diskmgmt
Please, name some of the device vendors (co.) and their corresponding devices (model no), that have a nbd driver. I suspect such strategy by them could work, if the base case covers 99% of all user needs. Those users that want to live on edge with latest bleeding edge non-mainstream features, will sort of be the type who know what they are doing, entertain risk and disable secureboot. In this case we are yet to determine if this wnbd-nbd driver, covers as a base case reliable nbd driver.
Btw, I think qemu can also provide a rbd device, I think. would I have better luck with qemu<->rbd<->wnbd-rbd-driver<->diskmgmt
Please, name some of the device vendors (co.) and their corresponding devices (model no), that have a wnbd driver.
Can't remember if it was a chelsio or a cisco network card driver - the Microsoft signed driver was about 2 years behind the vendor signed driver, most probably because of the same reason. It definitely wasn't a wnbd driver, just their network card driver. My point was that other vendors are affected by this as well and prefer not to go through the Microsoft signing process.
Btw, I think qemu can also provide a rbd device, I think. would I have better luck with qemu<->rbd<->wnbd-rbd-driver<->diskmgmt
at that point you can just take qemu-nbd out of the picture and use rbd. it depends on your use case.
20220926 Changes that may be in effect since when tests were last done and now
Haswell i7-4700QM
WinOS, qemu-VM, Test-image all on SSD, sgdata (synthetic generated data) on HDD
Windows-Home
winver: 21H2 (OS Build: 19044.2075)
Fedora 36 Workstation
Linux fedora 5.18.17-200.fc36.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Aug 11 14:36:06 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Ceph-Quincy-17.2.4 20220920
E:\vstorage\nbd>E:\vstorage\nbd\wnbd_client\wnbd-client.exe -v
wnbd-client.exe: 0.2.2-11-g3dbec5e
libwnbd.dll: 0.2.2-11-g3dbec5e
wnbd.sys: 0.2.2-11-g3dbec5e
E:\vstorage\nbd>qemu-nbd --version
qemu-nbd 7.1.0 (v7.1.0-11925-g4ec481870e-dirty)
Written by Anthony Liguori.
E:\vstorage\nbd>C:\vol\scoop_01\scoopg\apps\qemu\current\qemu-storage-daemon.exe --version
qemu-storage-daemon version 7.1.0 (v7.1.0-11925-g4ec481870e-dirty)
Copyright (c) 2003-2022 Fabrice Bellard and the QEMU Project developers
raw
dd if=/dev/zero of=/mnt/a16/gkpics01.raw.img bs=1048576 count=95104 status=progress
qcow2-fixed
qemu-img create -f qcow2 H:\gkpics01.qcow2 -o preallocation=full 99723771904
qcow2-dyn
qemu-img create -f qcow2 /mnt/a16/gkpics01.qcow2 99723771904
vhdx-fixed
qemu-img create -f vhdx -o subformat=fixed,,block_state_zero=off /mnt/a16/gkpics01.vhdx 99723771904
vhdx-dyn
qemu-img create -f vhdx -o subformat=dynamic /mnt/a16/gkpics01.vhdx 99723771904
raw
"C:\vol\prog_01\prog\windd_chrysocome\ddrelease64.exe" bs=1048576 count=95104 if=/dev/zero of=H:/gkpics01.raw.img --progress
qcow2-fixed ( Not Do-able as of now https://gitlab.com/qemu-project/qemu/-/issues/813 )
qemu-img.exe create -f qcow2 H:\gkpics01.qcow2 -o preallocation=full 99723771904
qcow2-dyn
qemu-img create -f qcow2 /mnt/a16/gkpics01.qcow2 99723771904
vhdx-fixed
New-VHD -Path I:\gkpics01.vhdx -SizeBytes 99723771904 -Fixed
or
qemu-img create -f vhdx -o subformat=fixed,block_state_zero=off F:/gkpics01.vhdx 99723771904
fsutil sparse setFlag F:\gkpics01.vhdx 0
vhdx-dyn
New-VHD -Path I:\gkpics01.vhdx -SizeBytes 99723771904 -Dynamic
or
qemu-img create -f vhdx -o subformat=dynamic F:/gkpics01.vhdx 99723771904
fsutil sparse setFlag F:\gkpics01.vhdx 0
[root@fedora mnt]# ( sdate=`date` ; echo "$sdate" ; dd if=/dev/zero of=/dev/vdc bs=$((512*204)) status=progress ; echo "$sdate" ; date )
Mon Sep 26 09:08:25 AM IST 2022
99677337600 bytes (100 GB, 93 GiB) copied, 1919 s, 51.9 MB/s
dd: error writing '/dev/vdc': No space left on device
954770+0 records in
954769+0 records out
99723771904 bytes (100 GB, 93 GiB) copied, 1920.97 s, 51.9 MB/s
Mon Sep 26 09:08:25 AM IST 2022
Mon Sep 26 09:40:26 AM IST 2022
[root@fedora gana]# mkfs -t ntfs -Q -L fs_gkpics01 /dev/vdc1
qemu-nbd as nbd-server
E:\vstorage\nbd>qemu-nbd.exe H:\gkpics01.raw.img -x gkpics01
WARNING: Image format was not specified for 'H:\gkpics01.raw.img' and probing guessed raw.
Automatically detecting the format is dangerous for raw images, write operations on block 0 will be restricted.
Specify the 'raw' format explicitly to remove the restrictions.
qemu-storage-daemon as nbd-server
direct raw
E:\vstorage\nbd> qemu-storage-daemon --blockdev driver=file,node-name=disk,filename=H:\gkpics01.raw.img --nbd-server addr.type=inet,addr.host=127.0.0.1,addr.port=10809 --export type=nbd,id=export,node-name=disk,name=gkpics01,writable=on
With a vdisk-format
qemu-storage-daemon --blockdev driver=file,node-name=file,filename=H:\gkpics01.qcow2 --blockdev driver=qcow2,node-name=qcow2,file=file --nbd-server addr.type=inet,addr.host=127.0.0.1,addr.port=10809 --export type=nbd,id=export,node-name=qcow2,name=gkpics01,writable=on
qemu-system-x86_64 -cpu qemu64 -m 4096 -machine "type=q35,kernel-irqchip=off" -accel whpx -smp "sockets=1,cores=8,threads=1" -boot c -drive "index=0,if=virtio,media=disk,format=raw,file=H:\m01_bootnoefi.raw.img" -drive "index=1,if=virtio,media=disk,format=raw,file=H:\m01_lnx.raw.img" -drive "index=2,if=virtio,media=disk,format=qcow2,file=H:\gkpics01.qcow2" -drive "index=3,if=virtio,media=disk,format=vhdx,file=E:\test\sgdata.vhdx" -display gtk -vga virtio -rtc base=utc -netdev user,id=vmnic1,net=192.168.20.0/24,dns=192.168.20.3,dhcpstart=192.168.20.15,hostfwd=tcp::9551-:22 -device virtio-net,netdev=vmnic1 -chardev qemu-vdagent,id=ch1,name=vdagent,clipboard=on -device virtio-serial -device virtserialport,chardev=ch1,id=ch1,name=com.redhat.spice.0 -qmp "tcp:127.0.0.1:5955,server,nowait"
PS E:\vstorage\nbd\wnbd_client> wnbd-client.exe map --hostname localhost --port 10809 --instance-name gkpics01
[root@fedora gana]# mount -t ntfs3 /dev/vdd2 /mnt/b
[root@fedora gana]# mount -t ntfs3 /dev/vdc1 /mnt/a
[root@fedora gana]# umount /mnt/a
[root@fedora gana]# umount /mnt/b
The ntfs3 fs-type argument is important, as Fedora 36 Linux, by default uses fuseblk driver
[root@fedora gana]# cat /etc/mtab | grep vdc1
without -t ntfs3 fstype-argument
/dev/vdc1 /mnt/a fuseblk rw,relatime,user_id=0,group_id=0,allow_other,blksize=4096 0 0
vs with -t ntfs3 fstype argument
/dev/vdc1 /mnt/a ntfs3 rw,relatime,uid=0,gid=0,iocharset=utf8 0 0
Get-Volume -FriendlyName fs_gkpics01 | Get-Partition | Add-PartitionAccessPath -accesspath "K:"
Get-Volume -Drive "K" | Get-Partition | Remove-PartitionAccessPath -accesspath "K:"
Set-Disk -Number $diskNumber -IsOffline $true
[root@fedora gana]# ( fl="photos001" ; src="/mnt/b" ; dst="/mnt/a" ; sdate=`date` ; cd "$src" ; rsync -avH "$fl" "$dst" ; echo "$sdate" ; date ; sudo -u gana DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/1000/bus DISPLAY=:0.0 -- notify-send "$src/$fl" "rsync $src/$fl" )
Use windows explorer gui in order to copy paste data from src folder to dst folder Using xcopy:
ps> pwsh -Command { $sdate=$(date) ; echo "$sdate" ; xcopy /E E:\test\sgdata K:\ ; echo "$sdate" ; echo "$(date)" }
:
15377 File(s) copied
09/25/2022 15:24:13
09/25/2022 15:45:21
PS C:\> sync K:
Sync v2.2 - Flush cached data to disk.
Copyright (C) 2016 Mark Russinovich
Sysinternals - www.sysinternals.com
Flushing: K
[root@fedora gana]# sync
PS E:\vstorage\nbd\wnbd_client> .\wnbd-client.exe unmap gkpics01
libwnbd.dll!WnbdRemoveEx WARNING Soft device removal failed. Falling back to hard removal. This happens even if I do Drive-letter-removal in diskmgmt
unmapping drive takes about 30 seconds before anything happens. Note the error message.
[root@fedora gana]# # verification under linux-VM
[root@fedora gana]# mount /dev/vdc1 /mnt/a
[root@fedora gana]# ( fl="photos001" ; rtpt="/mnt/a" ; cd "${rtpt}/${fl}" ; sdate=`date` ; echo "$sdate" ; sha256sum -c "$rtpt/$fl/find.CHECKSUM" --quiet ; echo "$sdate" ; date ; sudo -u gana DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/1000/bus DISPLAY=:0.0 -- notify-send "$src/$fl" "checksum $src/$fl" )
Sun Sep 25 11:51:34 AM IST 2022
Sun Sep 25 11:51:34 AM IST 2022
Sun Sep 25 12:13:53 PM IST 2022
PS C:\Users\gana> powershell -Command { $fl="photos001" ; $rtpt="K:" ; cd "${rtpt}/${fl}" ; $sdate=$(date) ; echo "$sdate" ; C:\vol\scoop_01\scoopg\apps\git\current\usr\bin\sha256sum.exe -c "$rtpt/$fl/find.CHECKSUM" --quiet ; echo "$sdate" ; echo "$(date)" }
C:\Users\gana\OneDrive\Documents\WindowsPowerShell
09/25/2022 15:55:43
09/25/2022 15:55:43
09/25/2022 16:07:55
The error message during Unmap/disconnect/detach drive in windows happens even if I manually remove-drive-letter in diskmgmt-gui before unmapping
Q1) how can one unmount the drive without any worry causing error
Tests [01,02,03,04] were done using fixed-size raw image. I will do more tests. Try and see what happens with expanding virtual-disks, but as you say, that puts wnbd partly off the hook. However, the wnbd driver should still handle it without lockup-disaster even if the nbd server is slow to catch up.
Just wondering: have you attached the same image to a windows host and linux vm simultaneously? That's likely going to corrupt the ntfs filesystem. In general, only a cluster aware filesystem can be safely attached to multiple hosts without risking data corruption (e.g. Windows Cluster Shared Volume, but that requires persistent SCSI reservations, which we don't support yet).
Q1) how can one unmount the drive without any worry causing error
Wnbd tries to do a "soft" unmap first, which notifies the Windows storage stack the the disk is going to be removed, letting other drivers flush their data before proceeding with the unmap. By default, we'll wait 30s before doing a hard remove as a fallback. If --debug
is specified, wnbd-client.exe
will specify which driver is preventing the disk from being unmapped.
Here are the parameters that unmap
can receive:
Usage: wnbd-client unmap <instance-name>
[--hard-disconnect] [--no-hard-disconnect-fallback]
[--soft-disconnect-timeout <arg>]
[--soft-disconnect-retry-interval <arg>]
Remove disk mapping.
Positional arguments:
<instance-name> Disk identifier.
Optional arguments:
[--hard-disconnect] Perform a hard disconnect. Warning:
pending IO operations as well as
unflushed data will be discarded.
[--no-hard-disconnect-fallback] Immediately return an error if the
soft disconnect fails instead of
attempting a hard disconnect as
fallback.
[--soft-disconnect-timeout <arg>] Soft disconnect timeout in seconds.
The soft disconnect operation uses
PnP to notify the Windows storage
stack that the device is going to be
disconnected. Storage drivers can
block this operation if there are
pending operations, unflushed caches
or open handles. Default: 30.
[--soft-disconnect-retry-interval <arg>] Soft disconnect retry interval in
seconds. Default: 2.
Common arguments:
[--debug] Enable debug logging.
Nope, I am certain to make sure the raw-image isn't simultaneously used by nbd-client and VM. By default, qemu-nbd is set to disconnect when a client disconnects. (There another option to let it remain online for a re-connection. -t, --persistent don't exit on the last connection
, but I haven't done that).
systemctl poweroff
), make sure qemu-system-x86_64 has given back control to cmd-prompt before I start qemu-nbd preparing for windows attachment. Yes, I know it would be very bad if two non-co-operating processes use and write-to the image simultaenously, not my intent.
There could be a bug there if soft-disconnect is not happening. I don't need to do any data transfer for that to happen. Just
In my tests, there is no data error, on account of hard-disconnect. I ensure to use sync
command before unmapping. But then, as I said, it is a little worry.
In the end goal, I envision to have, a 2Tb disk with many 100Gb vdisks on it, to be attached on a need basis.
Ideally, I wanna do away with different formats, and go with the ONE TRUE VDISK FORMAT that works well with Windows-OS or Linux-OS, be it hyperv or kvm/qemu, be it attached to HOST-OS or Guest-VM-OS
I hope to make my system as follows.
In the comment before last, you mentioned to get rid of qemu. Did you have another open-source server in mind for me to try on the windows host. I thought ceph-server needs to run on a linux-host. Also, the whole cluster-FS is too much overkill for a single user single machine system.
There could be a bug there if soft-disconnect is not happening.
Well, if there's another driver from the storage stack that's preventing us from doing a soft disconnect, there's not much we can do. We can either return an error or do a force unmap, which can be configured using the unmap
parameters. Again, it might be useful to know who's holding the disk (using the --debug
parameter).
In the comment before last, you mentioned to get rid of qemu. Did you have another open-source server in mind for me to try on the windows host. I thought ceph-server needs to run on a linux-host. Also, the whole cluster-FS is too much overkill for a single user single machine system.
indeed, ceph was definitely not intended for single host environments. if qemu-nbd causes problems, you may also consider lvm + iscsi or even zfs + iscsi.
A dummy map and unmap cycle:
As I had removed drive letter in diskmgmt
in a previous attempt, when wnbd-map command attaches the drive, diskmgmt
remembers not to assign a drive letter. So on this trial run, there is no need to even do sync
or remove-drive-letter. Nevertheless, unmap soft removal fails. All commands are executed under a minute.
PS E:\vstorage\nbd\wnbd_client> wnbd-client.exe map --hostname localhost --port 10809 --instance-name gkpics01 --debug
libwnbd.dll!WnbdOpenAdapterEx DEBUG Opening WNBD adapter device: \\?\root#scsiadapter#0000#{949dd17c-b06c-4c14-8e13-f1a3a4a6dbcb}
PS C:\> wnbd-client.exe list
Pid DiskNumber Nbd Owner InstanceName
2936 10 true wnbd-client gkpics01
PS C:\> wnbd-client.exe stats --instance-name gkpics01
Disk stats
TotalReceivedIORequests : 104
TotalSubmittedIORequests : 104
TotalReceivedIOReplies : 104
UnsubmittedIORequests : 0
PendingSubmittedIORequests : 0
AbortedSubmittedIORequests : 0
AbortedUnsubmittedIORequests : 0
CompletedAbortedIORequests : 0
OutstandingIOCount : 0
PS E:\vstorage\nbd\wnbd_client> wnbd-client.exe unmap --instance-name gkpics01 --debug
libwnbd.dll!WnbdOpenAdapterEx DEBUG Opening WNBD adapter device: \\?\root#scsiadapter#0000#{949dd17c-b06c-4c14-8e13-f1a3a4a6dbcb}
libwnbd.dll!WnbdOpenAdapterEx DEBUG Opening WNBD adapter device: \\?\root#scsiadapter#0000#{949dd17c-b06c-4c14-8e13-f1a3a4a6dbcb}
libwnbd.dll!PnpRemoveDevice DEBUG Could not remove device. CM status: 23, veto type 5, veto name: SCSI\Disk&Ven_WNBD&Prod_WNBD_DISK\1&2afd7d61&2&000500. Time elapsed: 5.82s
libwnbd.dll!PnpRemoveDevice DEBUG Could not remove device. CM status: 23, veto type 5, veto name: SCSI\Disk&Ven_WNBD&Prod_WNBD_DISK\1&2afd7d61&2&000500. Time elapsed: 13.63s
libwnbd.dll!PnpRemoveDevice DEBUG Could not remove device. CM status: 23, veto type 5, veto name: SCSI\Disk&Ven_WNBD&Prod_WNBD_DISK\1&2afd7d61&2&000500. Time elapsed: 21.45s
libwnbd.dll!PnpRemoveDevice DEBUG Could not remove device. CM status: 23, veto type 5, veto name: SCSI\Disk&Ven_WNBD&Prod_WNBD_DISK\1&2afd7d61&2&000500. Time elapsed: 29.31s
libwnbd.dll!PnpRemoveDevice DEBUG Could not remove device. CM status: 23, veto type 5, veto name: SCSI\Disk&Ven_WNBD&Prod_WNBD_DISK\1&2afd7d61&2&000500. Time elapsed: 37.09s
libwnbd.dll!WnbdRemoveEx WARNING Soft device removal failed. Falling back to hard removal.
That's useful information, the veto type is set to 5 (PNP_VetoOutstandingOpen) [1], so there's probably an open handle.
I usually use Process Hacker or the sysinternals tools to identify open handles. You could also try setting the disk offline before unmapping it:
Set-Disk -Number $diskNumber -IsOffline $true
[1] https://learn.microsoft.com/en-us/windows/win32/api/cfg/ne-cfg-pnp_veto_type#constants
I determined what the cause is.
I usually keep a admin privilege taskmgr.exe
process running for emergency troubleshooting.
If taskmgr is closed, soft terminations happens.
Why does taskmgr do that? It probably shouldn't
The EventManager-LogViewer eventvwr.msc
had the following info
map:
warn 4:37:15 NbdNegotiate:296 Ignoring unsupported NBD reply info type: 3
Info 4:37:15 Volume \\?\Volume{27ec1801-27fc-4877-b1dc-0b3a137bceb5} (\Device\HarddiskVolume76) is healthy. No action is needed.
unmap:
Warn 4:37:36 The application \Device\HarddiskVolume4\Windows\System32\Taskmgr.exe with process id 36964 stopped the removal or ejection for the device SCSI\Disk&Ven_WNBD&Prod_WNBD_DISK\1&2afd7d61&2&000700.
Info 4:37:39 Volume \\?\Volume{27ec1801-27fc-4877-b1dc-0b3a137bceb5} (\Device\HarddiskVolume76) is healthy. No action is needed.
Warn 4:37:45 The application \Device\HarddiskVolume4\Windows\System32\Taskmgr.exe with process id 36964 stopped the removal or ejection for the device SCSI\Disk&Ven_WNBD&Prod_WNBD_DISK\1&2afd7d61&2&000700.
Info 4:37:45 Volume \\?\Volume{27ec1801-27fc-4877-b1dc-0b3a137bceb5} (\Device\HarddiskVolume76) is healthy. No action is needed.
Warn 4:37:53 The application \Device\HarddiskVolume4\Windows\System32\Taskmgr.exe with process id 36964 stopped the removal or ejection for the device SCSI\Disk&Ven_WNBD&Prod_WNBD_DISK\1&2afd7d61&2&000700.
Info 4:37:55 Volume \\?\Volume{27ec1801-27fc-4877-b1dc-0b3a137bceb5} (\Device\HarddiskVolume76) is healthy. No action is needed.
Warn 4:38:01 The application \Device\HarddiskVolume4\Windows\System32\Taskmgr.exe with process id 36964 stopped the removal or ejection for the device SCSI\Disk&Ven_WNBD&Prod_WNBD_DISK\1&2afd7d61&2&000700.
Info 4:38:01 Volume \\?\Volume{27ec1801-27fc-4877-b1dc-0b3a137bceb5} (\Device\HarddiskVolume76) is healthy. No action is needed.
Warn 4:38:08 The application \Device\HarddiskVolume4\Windows\System32\Taskmgr.exe with process id 36964 stopped the removal or ejection for the device SCSI\Disk&Ven_WNBD&Prod_WNBD_DISK\1&2afd7d61&2&000700.
Warn 4:38:08 NbdReadExact:36 Failed with : 0
Err 4:38:38 NbdReadReply:490 Could not read command reply.
Warn 4:38:38 Disk 10 has been surprise removed.
Same thing happens, even if volume/disk is set to offline using the Set-Disk
command. diskmgmt shows disk having gone offline.
Apparently, that's an M$ thing
If taskmgr is closed, soft terminations happens. Why does taskmgr do that? It probably shouldn't
taskmgr provides various disk IO statistics, so I assume it holds some disk handles. That's normal behavior, nice catch though.
Warn 4:38:08 NbdReadExact:36 Failed with : 0 Err 4:38:38 NbdReadReply:490 Could not read command reply.
That message gets logged when an nbd connection is closed. I've added a TODO item to avoid logging an error since that can be confusing.
20230414 wnbd/qsd: Documenting a stuck on xcopy situation for reproduction
PS C:\Windows\system32> D:\vstorage\nbd\wnbd_client\wnbd-client.exe version
wnbd-client.exe: 0.4.1-10-g5c5239c
libwnbd.dll: 0.4.1-10-g5c5239c
wnbd.sys: 0.4.1-10-g5c5239c
Copy test
qemu-w64-setup-20230411.exe
https://qemu.weilnetz.de/w64/.
D:\tmpq\qemu\8.0.0-rc3\qemu\qemu-storage-daemon.exe --blockdev driver=file,node-name=file,filename=F:\gkpics01.qcow2 --blockdev driver=qcow2,node-name=qcow2,file=file --nbd-server addr.type=inet,addr.host=127.0.0.1,addr.port=10809 --export type=nbd,id=export,node-name=qcow2,name=gkpics01,writable=on
F:\gkpics01.qcow2
is qcow2-fixed 93GiB (prepare this qcow2-fixed vdisk file ahead of time in a linux machine, windows qemu-img as of today cannot create qcow2-fixed)D:\vstorage\nbd\wnbd_client\wnbd-client.exe map --hostname localhost --port 10809 --instance-name gkpics01
fs_gkpics01
Perform data copy test with xcopy command pwsh -Command { $sdate=$(date) ; echo "$sdate" ; xcopy /E D:\test\sgdata\photos001 J:\photos001 ; echo "$sdate" ; echo "$(date)" }
. The OS becomes stuck about 5 minutes into the roughly 15 minute operation. In sysinternals-procexp app, disk-usage-graph goes down from MBps to Bps.
:
D:\test\sgdata\photos001\D0014_127c4d62\File_0000.bin
... OS/xcopy is stuck here, below is after unstucking
File creation error - The system cannot find the path specified.
04/14/2023 17:07:12
04/14/2023 17:12:59
PS C:\Windows\system32> dir D:\test\sgdata\photos001\D0014_127c4d62\
Directory: D:\test\sgdata\photos001\D0014_127c4d62
Mode LastWriteTime Length Name
---- ------------- ------ ----
-a---- 11/22/2021 3:13 PM 4294694852 File_0000.bin
wnbd-client uninstall-driver
.
The qemu-storage-daemon which was silent till now, will be disrupted by the force removed disk will go (about 20 times)
D:\tmpq\qemu\8.0.0-rc3\qemu\qemu-storage-daemon.exe: Disconnect client, due to: Failed to send reply: Unable to write to socket: Input/output error
D:\tmpq\qemu\8.0.0-rc3\qemu\qemu-storage-daemon.exe: Disconnect client, due to: Failed to send reply: Unable to write to socket: Input/output error
:
The qemu-storage-daemon, however, still is in running condition, and will remain so, unless one presses Ctrl-C.
The xcopy will say File creation error - The system cannot find the path specified.
As the qemu-storage-deamon is still running , it is possible to reinstall wnbd-driver, map-volume, format-volume and restart xcopy from the beginning, This sort of indicates that the qemu-storage-deamon nbd-server is responsive.
In the below, who is \Device\RaidPort4? that raises event ID-129.
I don't even use raid.
Laptop has 3 disks sata-ssd, sata-hdd and sata-hdd via a msata-cdrom-caddy.
I later gather this raidport# number is whatever wnbd attaches itself as on the underside to storport.
Looking into the eventlog filter for event-id-129, source is usually wnbd.
Maybe, the number of times it shows up corresponds to number of times, I've let system get stuck by this test.
In the past, wnbd has raised event-id-129 as raidport1, raidport6, raidport10, so it depends on what it gets assigned during driver insertion.
The binary data payload seems to be the same for each event-id-129.
The below log is from a different run.
D:\test\sgdata\photos001\D0020_bf23a7e9\File_0452.bin
File creation error - A device which does not exist was specified.
04/19/2023 13:57:35
04/19/2023 14:07:43
I looked at eventvwr.msc
Windows-logs/System category.
This is the first event after a event-lul period , about 9 minutes into the xcopy step.
Log Name: System
Source: wnbd
Date: 4/19/2023 2:06:34 PM
Event ID: 129
Task Category: None
Level: Warning
Keywords: Classic
User: N/A
Computer: Capella-PC
Description:
The description for Event ID 129 from source wnbd cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.
If the event originated on another computer, the display information had to be saved with the event.
The following information was included with the event:
\Device\RaidPort4
The message resource is present but the message was not found in the message table
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="wnbd" />
<EventID Qualifiers="32772">129</EventID>
<Version>0</Version>
<Level>3</Level>
<Task>0</Task>
<Opcode>0</Opcode>
<Keywords>0x80000000000000</Keywords>
<TimeCreated SystemTime="2023-04-19T08:36:34.8522246Z" />
<EventRecordID>277544</EventRecordID>
<Correlation />
<Execution ProcessID="4" ThreadID="23460" />
<Channel>System</Channel>
<Computer>Capella-PC</Computer>
<Security />
</System>
<EventData>
<Data>\Device\RaidPort4</Data>
<Binary>0F001800010000000000000081000480010000000000000000000000000000000000000000000000000000000000000000000000810004800000000000000000</Binary>
</EventData>
</Event>
After this, there are many disk io errors, like so: There are 4 of these at 2:06:34
Log Name: System
Source: disk
Date: 4/19/2023 2:06:34 PM
Event ID: 153
Task Category: None
Level: Warning
Keywords: Classic
User: N/A
Computer: Capella-PC
Description:
The IO operation at logical block address 0x5f3038 for Disk 11 (PDO name: \Device\000000c0) was retried.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="disk" />
<EventID Qualifiers="32772">153</EventID>
<Version>0</Version>
<Level>3</Level>
<Task>0</Task>
<Opcode>0</Opcode>
<Keywords>0x80000000000000</Keywords>
<TimeCreated SystemTime="2023-04-19T08:36:34.8522246Z" />
<EventRecordID>277545</EventRecordID>
<Correlation />
<Execution ProcessID="4" ThreadID="23460" />
<Channel>System</Channel>
<Computer>Capella-PC</Computer>
<Security />
</System>
<EventData>
<Data>\Device\Harddisk11\DR34</Data>
<Data>0x5f3038</Data>
<Data>11</Data>
<Data>\Device\000000c0</Data>
<Binary>0F01040004002C0000000000990004800000000000000000000000000000000000000000000000000000022A</Binary>
</EventData>
</Event>
This is after the reset adapter
Log Name: System
Source: disk
Date: 4/19/2023 2:07:39 PM
Event ID: 157
Task Category: None
Level: Warning
Keywords: Classic
User: N/A
Computer: Capella-PC
Description:
Disk 11 has been surprise removed.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="disk" />
<EventID Qualifiers="32772">157</EventID>
<Version>0</Version>
<Level>3</Level>
<Task>0</Task>
<Opcode>0</Opcode>
<Keywords>0x80000000000000</Keywords>
<TimeCreated SystemTime="2023-04-19T08:37:39.6266117Z" />
<EventRecordID>277550</EventRecordID>
<Correlation />
<Execution ProcessID="4" ThreadID="24996" />
<Channel>System</Channel>
<Computer>Capella-PC</Computer>
<Security />
</System>
<EventData>
<Data>\Device\Harddisk11\DR34</Data>
<Data>11</Data>
<Binary>0000000002003000000000009D000480000000000000000000000000000000000000000000000000</Binary>
</EventData>
</Event>
Events that come after the previous: There are many ntfs-delayed-write-failed
and application-popup-delayed-write-failed
events, which are to be expected after disk removal.
Q) does the above eventvwr wnbd event-id-129 and its binary payload provide any debugging clue?
As in case of Azure link below, the wnbd interface to the network block device is an emulated device, and there no real physical scsi hardware for failure to happen. Perhaps if some nbd-communication times-out, it assumes a 129. Perhaps there is some file system maintenance for file-expansion or something that kicks in to cause the delay. Note: mine is a haswell quadcore machine, nbd-server and storport are colocated, and might not be as fast as machines you use.
Apparently, HyperV also had similar resets, which needed to be handled, as they were harmless emulated timing issues. https://support.microsoft.com/en-us/topic/many-id-129-events-are-logged-when-you-run-a-windows-server-2012-r2-virtual-machine-in-windows-azure-a9c9d4eb-7533-2704-6049-7428bbb565cb
qemu-storage-daemon, does not do any disk virtualization/physical-disk-emulation/bus-protocol-emulation, and its file/virtual-disk/block code processes data back-and-forth direct to nbd-export. Unlike qemu-nbd, which by default halts on client-connection-drop and requires a persistent command-line-argument to tell not to terminate, the qemu-storage-daemon process runs (persistent by default) until one does ctrl-C on it. So it should be possible to reconnect.
It just means that the nbd connection hanged and storport is issuing LUN resets.
I'll need to:
I've replicated the issue using qemu-storage-daemon.exe
. The NBD connection eventually hangs and we can no longer send requests or receive replies. Windows becomes extremely unresponsive when that happens.
This doesn't affect rbd-wnbd
, so the potential causes are:
Thanks again for the bug report, I'm reopening the issue. Sorry for the slow progress, our main focus is currently Ceph, we're providing the NBD client functionality as a best effort.
LE: NBD socket reads and writes hang indefinitely: https://pastebin.com/raw/6rrggyDs. We'll move the NBD client to userspace as soon as possible.
LLE: there seems to be a problem with qemu-storage-daemon on Windows. Wnbd works fine with Linux qemu-storage-daemon NBD servers or the standard Linux nbd-server.
I did prepare a draft that moves the NBD implementation to userspace, making it easier to maintain: https://github.com/cloudbase/wnbd/pull/122. I'll probably prepare another PR that forwards LUN resets to the wnbd daemon so that it may reset or remove the connection instead of blocking the host.
Looks like qemu's blk_co_pwrite
hangs at some point: https://github.com/qemu/qemu/blob/6972ef1440a9d685482d78672620a7482f2bd09a/nbd/server.c#L2514. This might also affect Qemu VMs on Windows.
@hgkamath I spent some time looking at qemu-storage-daemon
, some WriteFile calls were hanging. The good news is that it seems to work fine when passing cache.direct=on
, for example:
qemu-storage-daemon.exe --blockdev driver=file,node-name=file,filename=e:\img\test_100gb.raw,cache.direct=on --blockdev driver=raw,node-name=raw,file=file --nbd-server addr.type=inet,addr.host=127.0.0.1,addr.port=10809 --export type=nbd,id=export,node-name=raw,name=test_100gb,writable=on
I'll try that out.
One thought, that I wanted to pass along, was that
I discovered qemu-win also uses mingw-winpthreads by default.
There is a bug, which I filed myself, https://gitlab.com/qemu-project/qemu/-/issues/1605
in which I found that the workaround is to use native threads with aio=native
instead of the default mingw-winpthreads, the implicit option being aio=threads
I have encountered that particular qemu-bug only with rapid expansion of VHDX-dynamic vdisk-file (vhdx-dyn). The chance that it surfaces here is slim because, this test uses qcow2-fixed. a) for starters, qemu-s implementation of its codenative format qcow2 is much more performant than VHDX b) being a fixed vdisk, the block-allocation is complete and so this bug's vdisk file does not have to expand.
Also, your comment, also confirms it happens for raw format itself. Indeed, one should test in raw. This bug is independent of the qemu's vdisk format, and it is best to choose raw, to remove uncertainties arising due to qemu's vdisk implementations.
Nevertheless, I now test with a drive option aio=native
just to reduce the chance of that.
PS C:\Users\admin> qemu-storage-daemon.exe --blockdev driver=file,aio=native,node-name=file,filename=F:\gkpics01.qcow2 --blockdev driver=qcow2,node-name=qcow2,file=file --nbd-server addr.type=inet,addr.host=127.0.0.1,addr.port=10809 --export type=nbd,id=export,node-name=qcow2,name=gkpics01,writable=on
PS C:\Windows\system32> $(dir F:\gkpics01.qcow2).Length
99739238400
PS C:\Windows\system32> fsutil sparse queryflag F:\gkpics01.qcow2
This file is NOT set as sparse
Expectedly, I still did experience a hang with aio=native
also.
In one of the hangs, it was possible to do an uninstall-driver
, install-driver
, map
and resume/restart the xcopy
without having to Ctrl-C the qemu-storage-daemon
, meaning, that in that run, the qemu-storage-daemon
nbd-server was not stuck and nbd reconnection succeeded.
From the qemu manual pdf pg41 :
cache.direct : The host page cache can be avoided with cache.direct=on. This will attempt to do disk IO directly to the guest’s memory. QEMU may still perform an internal copy of the data.
For cache.direct
to work on windows. the file-open-flag O_DIRECT
should should be supported on windows/mingw/win32 api, also the filesystem and filesystem-driver should support it.
win32 file has FILE_FLAG_NO_BUFFERING for CreateFile, ReOpenFile, file-buffering
FILE_FLAG_NO_BUFFERING maps to the O_DIRECT flag for the Linux open command.
So, it does. block/file-win32.c
does use the flag.
In file-win32.c/file-posix.c, the function raw_parse_flags()
sets the corresponding flags based on if BDRV_O_NOCACHE is set, which is set if cache.direct
option is provided.
... just checking.
Its a good idea to quickly detect stuck situation, and react by uninstalling driver.
Superstition: I think, leaving machine unattended and risking it remaining in this stuck-state can lead to machine becoming slower and eventually having a more serious system-crash.
Techniques I use to quickly detect a stuck-situation
PS C:\Windows\system32> $i=0 ; while ($true -and ($i -lt 400)) { "{0} {1}" -f $(date),$(dir F:\gkpics01.qcow2).Length ; sleep 6 ; $i = $i + 1 }
PS C:\Windows\system32> $i=0 ; while ($true -and ($i -lt 400)) { "{0} {1} {2}" -f $(date),$(dir F:\gkpics01.qcow2).Length,$(dir F:\gkpics01.qcow2).LastWriteTime[0].ToString() ; "test" | Out-File -FilePath K:\tmpq\Downloads\a.txt ; sleep 6 ; $i = $i + 1 }
But with this, I could not reproduce the stuck-state, it was acting like a keep-alive. Another superstition that I am still watching out for is that I think if stuck happens its most likely at the really large files. Its not always the same file, but I will need to see if it is always one of the below large ones.
EDIT: This is maybe just that, a superstition, I started noting the ending file's file-size, and neither its, or the next's was always large, sometimes files were as small as 3Mib.
>512 MiB
D:\test\sgdata>forfiles /S /M * /C "cmd /c if @fsize GEQ 536870912 echo @path @fsize"
"D:\test\sgdata\photos001\D0011_325a7df4\D0004_ff4697dc\File_0000.bin" 870548550
"D:\test\sgdata\photos001\D0011_325a7df4\D0007_cfa36bee\File_0000.bin" 917280712
"D:\test\sgdata\photos001\D0011_325a7df4\D0007_cfa36bee\File_0001.bin" 1193905888
"D:\test\sgdata\photos001\D0014_127c4d62\File_0000.bin" 4294694852
"D:\test\sgdata\photos001\D0015_a42d4363\File_0000.bin" 2381007674
"D:\test\sgdata\photos001\D0015_a42d4363\File_0001.bin" 2103925866
"D:\test\sgdata\photos001\D0016_bcd34db1\File_0012.bin" 1223441112
"D:\test\sgdata\photos001\D0016_bcd34db1\File_0014.bin" 1498293582
"D:\test\sgdata\photos001\D0016_bcd34db1\File_0019.bin" 800657438
"D:\test\sgdata\photos001\D0017_6986e32f\File_0000.bin" 1889579172
"D:\test\sgdata\photos001\D0017_6986e32f\File_0001.bin" 1978039382
"D:\test\sgdata\photos001\D0020_bf23a7e9\File_1824.bin" 885577878
"D:\test\sgdata\photos001\D0020_bf23a7e9\File_1849.bin" 758393712
"D:\test\sgdata\photos001\D0020_bf23a7e9\File_1861.bin" 622607906
"D:\test\sgdata\photos001\D0020_bf23a7e9\File_1867.bin" 559405836
"D:\test\sgdata\photos001\D0021_1ab60b0c\File_1027.bin" 790637902
"D:\test\sgdata\photos001\D0021_1ab60b0c\File_1030.bin" 859589186
"D:\test\sgdata\photos001\D0024_fded2015\D0000_2b73ea53\File_0030.bin" 586790162
"D:\test\sgdata\photos001\D0024_fded2015\D0000_2b73ea53\File_0032.bin" 710406956
"D:\test\sgdata\photos001\D0024_fded2015\D0000_2b73ea53\File_0034.bin" 632330336
"D:\test\sgdata\photos001\D0024_fded2015\D0000_2b73ea53\File_0036.bin" 662133078
"D:\test\sgdata\photos001\D0024_fded2015\D0000_2b73ea53\File_0040.bin" 1157315322
"D:\test\sgdata\photos001\D0024_fded2015\D0000_2b73ea53\File_0042.bin" 1553586422
"D:\test\sgdata\photos001\D0025_7bd77bae\File_0110.bin" 785962104
"D:\test\sgdata\photos001\D0025_7bd77bae\File_0112.bin" 689221632
"D:\test\sgdata\photos001\D0026_409200f8\File_0034.bin" 797661226
"D:\test\sgdata\photos001\D0026_409200f8\File_0036.bin" 1151132800
"D:\test\sgdata\photos001\D0030_2a29839c\File_0516.bin" 835986114
"D:\test\sgdata\photos001\D0031_b20f40a7\File_0045.bin" 828452346
>1 GiB
D:\test\sgdata>forfiles /S /M * /C "cmd /c if @fsize GEQ 1073741824 echo @path @fsize"
"D:\test\sgdata\photos001\D0011_325a7df4\D0007_cfa36bee\File_0001.bin" 1193905888
"D:\test\sgdata\photos001\D0014_127c4d62\File_0000.bin" 4294694852
"D:\test\sgdata\photos001\D0015_a42d4363\File_0000.bin" 2381007674
"D:\test\sgdata\photos001\D0015_a42d4363\File_0001.bin" 2103925866
"D:\test\sgdata\photos001\D0016_bcd34db1\File_0012.bin" 1223441112
"D:\test\sgdata\photos001\D0016_bcd34db1\File_0014.bin" 1498293582
"D:\test\sgdata\photos001\D0017_6986e32f\File_0000.bin" 1889579172
"D:\test\sgdata\photos001\D0017_6986e32f\File_0001.bin" 1978039382
"D:\test\sgdata\photos001\D0024_fded2015\D0000_2b73ea53\File_0040.bin" 1157315322
"D:\test\sgdata\photos001\D0024_fded2015\D0000_2b73ea53\File_0042.bin" 1553586422
"D:\test\sgdata\photos001\D0026_409200f8\File_0036.bin" 1151132800
Confirming that the workaround cache.direct=on
gave me a successful run.
Will do future tests both with and without it.
I theorize that be removing the OS/host-caching from the picture, probably increases the responsivity (time to respond to wnbd-client), and hence makes it less likely for the protocol to trip and lead into stuck-state.
Probably WinOS sends processes to sleep state during the time of data going into cache and out.
The question becomes: is it winOS tripping on caching, qemu tripping on caching, or wnbd client tripping due to delay.
PS C:\Windows\system32> qemu-storage-daemon.exe --blockdev driver=file,aio=native,node-name=file,filename=F:\gkpics01.qcow2,cache.direct=on --blockdev driver=qcow2,node-name=qcow2,file=file --nbd-server addr.type=inet,addr.host=127.0.0.1,addr.port=10809 --export type=nbd,id=export,node-name=qcow2,name=gkpics01,writable=on
PS C:\Windows\system32> powershell -Command { $fl="photos001" ; $rtpt="J:" ; cd "${rtpt}/${fl}" ; $sdate=$(date) ; echo "$sdate" ; C:\vol\scoop_01\scoopg\apps\git\current\usr\bin\sha256sum.exe -c "$rtpt/$fl/find.CHECKSUM" --quiet ; echo "$sdate" ; echo "$(date)" }
:
15377 File(s) copied
05/18/2023 17:33:06
05/18/2023 17:50:19
# unmap remap
PS C:\Windows\system32> D:\vstorage\nbd\wnbd_client\wnbd-client.exe unmap gkpics01
libwnbd.dll!PnpRemoveDevice WARNING Could not remove device. CM status: 23, veto type 6, veto name: STORAGE\Volume\{50a6c206-f464-11ed-9807-3464a9d34f26}#0000000001000000. Time elapsed: 1.87s, time left: 28s. Check the Windows System event log (Kernel-PnP source) for more information about the process that is holding the disk.
PS C:\Windows\system32> D:\vstorage\nbd\wnbd_client\wnbd-client.exe map --hostname localhost --port 10809 --instance-name gkpics01
PS C:\Windows\system32>
PS C:\Windows\system32> powershell -Command { $fl="photos001" ; $rtpt="J:" ; cd "${rtpt}/${fl}" ; $sdate=$(date) ; echo "$sdate" ; C:\vol\scoop_01\scoopg\apps\git\current\usr\bin\sha256sum.exe -c "$rtpt/$fl/find.CHECKSUM" --quiet ; echo "$sdate" ; echo "$(date)" }
05/18/2023 17:50:49
05/18/2023 17:50:49
05/18/2023 18:06:20
Thanks for the confirmation. It doesn't seem related to NBD or the wnbd client. Qemu's WriteFile calls [1] are hanging unless we disable caching. I didn't dig too much into qemu's caching related code so I'm not sure if apart from FILE_FLAG_NO_BUFFERING
[2] there are any other changes when passing cache-direct=on
.
I wouldn't completely exclude some Windows overlapped IO weirdness, we had some similar issues [3].
[1] https://github.com/qemu/qemu/blob/d27e7c359330ba7020bdbed7ed2316cb4cf6ffc1/block/file-win32.c#L86 [2] https://github.com/qemu/qemu/blob/d27e7c359330ba7020bdbed7ed2316cb4cf6ffc1/block/file-win32.c#L281 [3] https://github.com/cloudbase/wnbd/pull/30
It feels like we're getting somewhere. Much thanks to your effort.
Most of what I write here I probably just verbiage to what you wrote succinctly.
(also do note that I am a user not a developer, so terms I say might not be correct)
In the qemu-project, on the windows side of things, there are only maintainers and no developers. Windows bugs get fixed if it was a happy side-effect of something else they were doing, usually on linux, or when windows users who have done the troubleshooting, nail something down, point out what-exactly has to be done. There aren't that many advanced-level windows-users.
Just to remember: When drive-lockup/system-stuck happens, only open windows/applications respond, until they are forced to do some disk-access. If an application attempts to do something with the disk, it freezes and becomes unresponsive. Ex notepad, powershell command load, etc.
Either one of following is true:
1) the WriteFile() is just yet another symptom of the drive-lockup/system-stuck (the chain of events leading to drive-lockup being caused elsewhere) , which would be a possibility if, as you said, if cause could be pinned to something cache.direct=on
did elsewhere, apart from setting FILE_FLAG_NO_BUFFERING
, wherever-else BDRV_O_NOCACHE influences ... and the overlapped-io-weirdness
2) or the WriteFile(), in conjunction with other asynchronous read/write calls, due to some combination of faulty flags and structure-parameters, is causing the block/deadlock, and everything is stuck because of it
3) or the WriteFile() is actively part of a cylic-chain of components that are in deadlock (cyclic-deadlock)
If it is a cyclic deadlock
Writefile()
is stuck because win10 is in drive-lockup. cache.direct=on
the disabling of host-pagefile-caching/file-buffering,
J:
is force unmounted, Probably a Write() call return in error, xcopy
process bails-out/get terminated.qemu-storage-daemon
, the WriteFile() call returns, and so it goes back to being responsiveMaybe,
cache.direct=on
necessary and one way to avoid the avoid cyclic-deadlock. It obviously works by cutting out win10-OS from the picture, removing one extra copy-in copy-out step. But why does it work? Could it be that under more rigorous OS workloads, deadlock may happen even without file-buffering.cache.direct=on
is unnecessary.J:
). File-buffering in the in-between layers stage (F:
and qemu-nbd-block-device emulation) is redundant
But even if host-pagefile-caching/file-buffering, though unnecessary, it is allowed to be enabled, one should not run into drive-lockup/system-stuck, there must be another way to to prevent the cyclic-deadlock..
/J
option. I'll do some test-runs with that and default cache.direct=off
PS C:\Windows\system32> qemu-storage-daemon.exe --blockdev driver=file,aio=native,node-name=file,filename=F:\gkpics01.qcow2 --blockdev driver=qcow2,node-name=qcow2,file=file --nbd-server addr.type=inet,addr.host=127.0.0.1,addr.port=10809 --export type=nbd,id=export,node-name=qcow2,name=gkpics01,writable=on
PS C:\Windows\system32> xcopy /?
:
/J Copies using unbuffered I/O. Recommended for very large files.
:
PS C:\Windows\system32> $sdate=$(date) ; echo "$sdate" ; xcopy /J /E D:\test\sgdata\photos001 J:\photos001 ; echo "$sdate" ; $edate=$(date) ; echo "$edate
15377 File(s) copied
05/19/2023 15:00:32
05/19/2023 15:18:43
PS C:\Windows\system32> powershell -Command { $fl="photos001" ; $rtpt="J:" ; cd "${rtpt}/${fl}" ; $sdate=$(date) ; echo "$sdate" ; C:\vol\scoop_01\scoopg\apps\git\current\usr\bin\sha256sum.exe -c "$rtpt/$fl/find.CHECKSUM" --quiet ; echo "$sdate" ; echo "$(date)" }
05/19/2023 15:23:59
05/19/2023 15:23:59
05/19/2023 15:39:13
Unbuffered write
Its also possible that if the nbd-client part of wnbd were jettisoned into userspace, the the OS/kernel would not be insistent that wnbd should respond immediately or within short timebound, allowing for a slower or delayed response from the emulated drive..
Q) Another question is to learn from you. Requesting you to document what you did, and what tools you used. As seen in bug qemu-#1605 so far, I am able to compile my own qemu-8.0 with debuginfo symbols using mingw, and do some very basic troubleshooting using gdb like putting assert() /printf() statements and inspecting stack frame variables.
How did you troubleshoot this? How did you know qemu was stuck on writefile()
That's a very interesting observation about the cache dependency. I'd be curious to see what happens when connecting to qemu-storage-daemon
from another host/vm (either Linux or Windows), thus avoiding this dependency.
So the idea is that WNBD disk writes are cached and so are the underlying qemu image file writes. This might lead to a deadlock when Windows attempts to flush its cache. I'm not really familiar with the Windows cache manager internals, the guys from OSR (https://community.osr.com) might be able to provide more insight (some Microsoft engineers used to post there). However, it's highly likely that writes are blocked while there's a pending cache flush, despite targeting a different disk.
If this situation is confirmed, we could document something along those lines:
When using WNBD to connect to a local NBD server, make sure to disable
caching either at the WNBD disk level or at the NBD server level. Windows cache
flushes may otherwise deadlock.
For example, when using ``qemu-storage-daemon.exe`` locally, make sure to pass
``cache.direct=on`` like so:
qemu-storage-daemon.exe \
--blockdev driver=file,node-name=file,filename=e:\img\test_100gb.raw,cache.direct=on \
--blockdev driver=raw,node-name=raw,file=file \
--nbd-server addr.type=inet,addr.host=127.0.0.1,addr.port=10809 \
--export type=nbd,id=export,node-name=raw,name=test_100gb,writable=on
Its also possible that if the nbd-client part of wnbd were jettisoned into userspace, the the OS/kernel would not be insistent that wnbd should respond immediately or within short timebound, allowing for a slower or delayed response from the emulated drive..
I've been testing using the userspace nbd client proposed here: https://github.com/cloudbase/wnbd/pull/122. The behavior is the same.
Q) Another question is to learn from you. Requesting you to document what you did, and what tools you used. As seen in bug qemu-#1605 so far, I am able to compile my own qemu-8.0 with debuginfo symbols using mingw, and do some very basic troubleshooting using gdb like putting assert() /printf() statements and inspecting stack frame variables. How did you troubleshoot this? How did you know qemu was stuck on writefile()
I did mostly the same. I was initially considering using Ubuntu to cross compile Qemu for Windows, however it was easier to go with msys2 since it already had all the dependencies (e.g. glib).
Then I checked the nbd server code, looking for the main entry points for NBD IO operations. I was a bit familiar with the qemu block drivers since I wrote a few patches for vhd(x) about 10 years ago :). I knew that raw image file writes should eventually hit this: https://github.com/qemu/qemu/blob/d27e7c359330ba7020bdbed7ed2316cb4cf6ffc1/block/file-win32.c#L86.
I added a few log messages to track IO requests more easily: https://github.com/petrutlucian94/qemu/commit/b18da91174c73cebec8c1a2e8a13726f7c6687c4. Probably I could've just used qemu tracing, however I haven't used that in a while and it was easier to just add a few print messages :). I was running wnbd in debug mode and checking the last 10 requests or so before the deadlock occurred. I was mostly ensuring that the write address, length and nbd tag match the ones received by the nbd server.
Sure, a stack trace would've also told me about the WriteFile hang, however I was also looking for unusual IO patterns. When building qemu using mingw-gcc, the debug symbols use dwarf format, which can't be used by WinDBG. I usually use this gdb fork for opening Windows dumps (the upstream version doesn't support this feature): https://github.com/ssbssa/gdb/releases/tag/gdb-13.1.90.20230220.
Another option might be to compile qemu using mingw-llvm, which has .pdb symbol support, which can be handled by WinDBG.
In comment, 1551603847 I had this idea of using a script to detect drive-lock, quickly, visually without manual checking.
PS C:\Windows\system32> $i=0 ; while ($true -and ($i -lt 400)) { "{0} {1}" -f $(date),$(dir F:\gkpics01.qcow2).Length ; sleep 6 ; $i = $i + 1 }
PS C:\Windows\system32> $i=0 ; while ($true -and ($i -lt 400)) { "{0} {1} {2}" -f $(date),$(dir F:\gkpics01.qcow2).Length,$(dir F:\gkpics01.qcow2).LastWriteTime[0].ToString() ; "test" | Out-File -FilePath K:\tmpq\Downloads\a.txt ; sleep 6 ; $i = $i + 1 }
But with this, I could not reproduce the stuck-state, it was acting like a keep-alive.It feels like it is less likely for the stuck-state to happen, (shows up much later into the run), when I just stay glued to the same desktop-screen with the powershell windows and do nothing else but wait for xcopy or notepad to become stuck, and hoping to see a stuck script as well. Either stuck-state happens but script does not stop, or stuck-state-never-happens. It feels like it is more likely for the stuck-state to happen (shows up sooner into the run), when I had multi-tasked to a Firefox window on a different desktop-screen (win10 multiple desktop feature), which is what I did most of the time. After a while, as one browses, drive-lock happens. When a new weblink/tab is opened, the Firefox window freezes/becomes unresponsive. Perhaps, using something like Firefox, creates contention for filebuffer cache. One might take longer notice the stuck state, if one only context switches between browser tabs for which Firefox already has the pages artifacts in memory. Its a good thing that even in the stuck-state/drive-lockup, when things like even start-menu and taskbar might freeze, one can switch between desktops using Ctrl-Win-Left and Ctrl-Win-Right Keyboard shortcuts. (Rapid switching between desktop screens does not increase chances of the bug)
So file-activity on the host may have an influence on this bug.
These hunches I get could be superstitions.
That something external to both qsd and wnbd has this effect makes me think. This could be a bug involving file-buffering inside Win10 kernel itself.
writes are blocked while there's a pending cache flush, despite targeting a different disk
A cache flush of files at a top layer storage (J:
), become block-writes into a vdisk-file, that are also cached, the vdisk-file being located on a lower layer (F:
).
The Windows File Cache-Manager manages Caching per file object. ref: learn.ms/file-caching
Maybe fs-filebuffer for F:
caches a small no blocks of the single large vdisk file (93 GiB), perhaps those blocks from the last few accesses WriteFile()
/ ReadFile()
The qsd workaround works because when J:
files are being flushed, the qemu-writeFile() F:
no longer requires interception and decision by the Windows File Cache-Manager.
The xcopy /J workaround works because, there is no J:
layer file-buffer cache to flush that is intense enough to cause F: writes to block.
Write-VolumeCache
learn.ms/write-volumecache and sync F J
sysinternals/sync are used to manually commit file-buffer to disk. An admin usually invokes them before unmounting/detaching a volume/disk. FlushFileBuffers()
function. Takes file handle as argument. learn.ms/flushing-buffered-io , learn.ms/FlushFIleBuffers
Not yet had a testrun where workarounds (cache.direct=off or xcopy /j) did not work, That's good. Though the reproducibility without the workaround is probabilistic, the reason I may have had more certainty of running into it, is because I used to let it this run in the background while doing other things, including using Firefox to browse. In the 85GiB copy, most stucks do happen under copying of 13 GiB (5 min), I get a bit antsy if xcopy passes that mark. But, I've also had a testrun wherein the stuck-state happened near to the end of the xcopy.
In the event success happens, one needs to repeat the test a few times, in order to get a statistical feel so as to differentiate between the following cases.
I had a simple loop that was just copying files and it used to hang after about 5 minutes. After disabling the cache on the nbd server side, it transferred a few TBs overnight without problems.
The best thing we can do right now is to document this, I'll send a patch. Another option would be to forward LUN resets to the nbd daemon, however that would still be less than ideal: having IO stuck every 5 minutes for 30s (under load), until Storport issues a reset.
By the way, I've merged the PR that moves the NBD daemon to userspace (libwnbd): https://github.com/cloudbase/wnbd/pull/122.
LE: I've documented it here: https://github.com/cloudbase/wnbd/pull/123.
After documenting the NBD caching issue and implementing stale connection detection (https://github.com/cloudbase/wnbd/pull/126), I think we can close this issue.
Yes, go ahead,
This is the corresponding issue on the qemu side, just for reference.
https://gitlab.com/qemu-project/qemu/-/issues/814
It only contains only occasional summaries as things cleared up.
I may need to update its description as many confounding issues cleared since time of first filing.
In a direct email communication that I had with qemu block-code developers, some points that came up :
I was doing some testing on WIN10-21H2-19044-1415/WNBD-0.2.2-4-g10c1fbe qemu-6.2.0-rc4/ExFAT/qcow2/NTFS
For complete details, see https://gitlab.com/qemu-project/qemu/-/issues/727#note_780895802
On Linux, only the VHDX format corrupts during expansion, while qcow2 and other virtual-disk formats seem fine. But on Windows, I found that even the qcow2 format too can corrupt during expansion (TEST E). For this reason, I think this is a separate distinct bug from the original and wnbd may also be a factor. The windows only bug is in addition to the qemu-only bug, which is also lurking in there.
The bug is reproducible. A script for synthetic data generation is provided in an earlier comment in that issue filing (link).
It is possible that this is a qemu-nbd-windows only bug, but then the wnbd driver is also involved here. The semi-official driver was test-signed and seemed to be still under development. I guess the bug needs to be squashed by troubleshooting and closing in on it from both sides.
The 5 test-cases can also be part of wnbd testing.
Q) Is there are recommended version of qemu-nbd that is known to work reliably with WNBD and the two together have been well tested?