Closed the-debris closed 4 years ago
Hi @the-debris ,
Do you also have corrupted files when making a copy from you shared FUSE to a local drive ?
Looking at mirror log the of offset does not even follow each of them
###Read 0238
ReadFile : C:\Users\user\test_mirror\orig.bin
Byte to read: 32768, Byte read 32768, offset 0
###Read 0238
ReadFile : C:\Users\user\test_mirror\orig.bin
Byte to read: 32768, Byte read 30856, offset 18649088
###Read 0238
ReadFile : C:\Users\user\test_mirror\orig.bin
Byte to read: 32768, Byte read 32768, offset 17625088
###Read 0238
ReadFile : C:\Users\user\test_mirror\orig.bin
Byte to read: 32768, Byte read 32768, offset 12038144
###Read 0238
ReadFile : C:\Users\user\test_mirror\orig.bin
Byte to read: 32768, Byte read 32768, offset 18456576
###Read 0238
ReadFile : C:\Users\user\test_mirror\orig.bin
Byte to read: 32768, Byte read 32768, offset 12079104
Could you use procmon and see what offset are used ? At least to see if that's the correct offset or not
(debug_view.LOG logs are only from network redirector, to have kernel log a debug build need to be used)
Hi Liryna,
Thank's for replying so quickly!
Yes, I observe similar data corruption when I copy from my shared FUSE.
As you suggested, I used ProcMon to sniff
and compared the results obtained by copying
Windows seems to perform 2 steps in both cases:
1) At least one sequence of 32 kB read operations with weird offsets. I can only see that it starts with the end of the file. There are no write operations at this point This sequence can occur more than one with exacly the same offsets each time More importantly: this sequence is identical wether you copy from M: or \\localhost\M I believe these are the offsets you pointed out. They're most likely Windows internal cooking.
mirror.exe|3924|ReadFile|file.bin|SUCCESS|Offset: 0, Length: 32 768, Priority: Normal
mirror.exe|3924|ReadFile|file.bin|SUCCESS|Offset: 18 649 088, Length: 30 856
mirror.exe|3924|ReadFile|file.bin|SUCCESS|Offset: 17 625 088, Length: 32 768
mirror.exe|3924|ReadFile|file.bin|SUCCESS|Offset: 12 038 144, Length: 32 768
mirror.exe|3924|ReadFile|file.bin|SUCCESS|Offset: 18 456 576, Length: 32 768
mirror.exe|3924|ReadFile|file.bin|SUCCESS|Offset: 12 079 104, Length: 32 768
mirror.exe|3924|ReadFile|file.bin|SUCCESS|Offset: 15 216 640, Length: 32 768
mirror.exe|3924|ReadFile|file.bin|SUCCESS|Offset: 16 269 312, Length: 32 768
mirror.exe|3924|ReadFile|file.bin|SUCCESS|Offset: 16 441 344, Length: 32 768
mirror.exe|3924|ReadFile|file.bin|SUCCESS|Offset: 15 269 888, Length: 32 768
mirror.exe|3924|ReadFile|file.bin|SUCCESS|Offset: 18 558 976, Length: 32 768
mirror.exe|3924|ReadFile|file.bin|SUCCESS|Offset: 17 768 448, Length: 32 768
mirror.exe|3924|ReadFile|file.bin|SUCCESS|Offset: 17 735 680, Length: 32 768
mirror.exe|3924|ReadFile|file.bin|SUCCESS|Offset: 12 005 376, Length: 32 768
mirror.exe|3924|ReadFile|file.bin|SUCCESS|Offset: 16 408 576, Length: 32 768
mirror.exe|3924|ReadFile|file.bin|SUCCESS|Offset: 16 338 944, Length: 32 768
mirror.exe|3924|ReadFile|file.bin|SUCCESS|Offset: 16 474 112, Length: 32 768
mirror.exe|3924|ReadFile|file.bin|SUCCESS|Offset: 16 506 880, Length: 32 768
2) A sequence of 1 MB read operations, all of them followed by a write operation on my destination path. This is obviously the copy and there are interesting differences: the copy from M: follows a logical order while the copy from \\localhost\M uses the correct offsets but swaps/forgets/doubles them
## From M:
mirror.exe|3924|ReadFile|file.bin|SUCCESS|Offset: 0, Length: 1 048 576
mirror.exe|3924|ReadFile|file.bin|SUCCESS|Offset: 1 048 576, Length: 1 048 576
mirror.exe|3924|ReadFile|file.bin|SUCCESS|Offset: 2 097 152, Length: 1 048 576
mirror.exe|3924|ReadFile|file.bin|SUCCESS|Offset: 3 145 728, Length: 1 048 576
mirror.exe|3924|ReadFile|file.bin|SUCCESS|Offset: 4 194 304, Length: 1 048 576
mirror.exe|3924|ReadFile|file.bin|SUCCESS|Offset: 5 242 880, Length: 1 048 576
mirror.exe|3924|ReadFile|file.bin|SUCCESS|Offset: 6 291 456, Length: 1 048 576
mirror.exe|3924|ReadFile|file.bin|SUCCESS|Offset: 7 340 032, Length: 1 048 576
mirror.exe|3924|ReadFile|file.bin|SUCCESS|Offset: 8 388 608, Length: 1 048 576
mirror.exe|3924|ReadFile|file.bin|SUCCESS|Offset: 9 437 184, Length: 1 048 576
mirror.exe|3924|ReadFile|file.bin|SUCCESS|Offset: 10 485 760, Length: 1 048 576
mirror.exe|3924|ReadFile|file.bin|SUCCESS|Offset: 11 534 336, Length: 1 048 576
mirror.exe|3924|ReadFile|file.bin|SUCCESS|Offset: 12 582 912, Length: 1 048 576
mirror.exe|3924|ReadFile|file.bin|SUCCESS|Offset: 13 631 488, Length: 1 048 576
mirror.exe|3924|ReadFile|file.bin|SUCCESS|Offset: 14 680 064, Length: 1 048 576
mirror.exe|3924|ReadFile|file.bin|SUCCESS|Offset: 15 728 640, Length: 1 048 576
mirror.exe|3924|ReadFile|file.bin|SUCCESS|Offset: 16 777 216, Length: 1 048 576
mirror.exe|3924|ReadFile|file.bin|SUCCESS|Offset: 17 825 792, Length: 854 152
## From \\localhost\M
mirror.exe|9924|ReadFile|file.bin|SUCCESS|Offset: 0, Length: 1 048 576
mirror.exe|9924|ReadFile|file.bin|SUCCESS|Offset: 3 145 728, Length: 1 048 576
mirror.exe|9924|ReadFile|file.bin|SUCCESS|Offset: 2 097 152, Length: 1 048 576
mirror.exe|9924|ReadFile|file.bin|SUCCESS|Offset: 1 048 576, Length: 1 048 576
mirror.exe|9924|ReadFile|file.bin|SUCCESS|Offset: 5 242 880, Length: 1 048 576
mirror.exe|9924|ReadFile|file.bin|SUCCESS|Offset: 6 291 456, Length: 1 048 576
mirror.exe|9924|ReadFile|file.bin|SUCCESS|Offset: 7 340 032, Length: 1 048 576
mirror.exe|9924|ReadFile|file.bin|SUCCESS|Offset: 8 388 608, Length: 1 048 576
mirror.exe|9924|ReadFile|file.bin|SUCCESS|Offset: 9 437 184, Length: 1 048 576
mirror.exe|9924|ReadFile|file.bin|SUCCESS|Offset: 9 437 184, Length: 1 048 576
mirror.exe|9924|ReadFile|file.bin|SUCCESS|Offset: 7 340 032, Length: 1 048 576
mirror.exe|9924|ReadFile|file.bin|SUCCESS|Offset: 11 534 336, Length: 1 048 576
mirror.exe|9924|ReadFile|file.bin|SUCCESS|Offset: 12 582 912, Length: 1 048 576
mirror.exe|9924|ReadFile|file.bin|SUCCESS|Offset: 14 680 064, Length: 1 048 576
mirror.exe|9924|ReadFile|file.bin|SUCCESS|Offset: 15 728 640, Length: 1 048 576
mirror.exe|9924|ReadFile|file.bin|SUCCESS|Offset: 13 631 488, Length: 1 048 576
mirror.exe|9924|ReadFile|file.bin|SUCCESS|Offset: 16 777 216, Length: 1 048 576
mirror.exe|9924|ReadFile|file.bin|SUCCESS|Offset: 17 825 792, Length: 854 152
The mirror logs are consistant with these observations (search on Byte to read:
).
This time, I actually captured the DokanFS kernel logs (sorry about last time). I hope they'll be helpfull. shared_drive_debugview.log
Hi @the-debris ,
Am having difficulty to reproduce this. Does it also happen without any sharing and simple local mirror ? (please share the mirror cmds) Can you display large images like https://sample-videos.com/img/Sample-jpg-image-30mb.jpg ?
(The kernel logs are without the read/write π’ we should see the offset value in it)
Hi @Liryna
Here is my detailled procedure
1) I execute the following command : mirror.exe /r C:\Users\user\test_mirror /l M:
(so the basic example).
2) I copy file.bin from M: to wherever on C: and it works fine
3) Then I go to Computer -> M: (appears as a local drive) : Right click -> Properties -> Sharing -> Advanced sharing -> Enable sharing (for all users) as M
4) Same as 2) : it works fine again
5) I copy file.bin from \\localhost\M to wherever on C: and the resulting file is corrupted
6) On another machine, I copy file.bin from _\\<1st_computer_IP>\M_ to wherever on C: and the resulting file is corrupted
I tried mirror.exe /m /r C:\Users\user\test_mirror /l M:
with exactly the same results as above (except M: appears as removable).
The results were identical when I tried to copy an image (.png/.jpg) : OK from local drive ; KO from network. (Since I couldn't access your website, I used this satellite image (39 MB))
However, when I open the .jpg file from \\localhost\M, there doesn't seem to be any corruption. I also converted it to .png to make sure the photo viewer would immediately read the entire file: still no visible problem. The only difference I can see in the procmon logs is that the W10 photo viewer reads 32kB blocks to display the image while explorer.exe reads 1MB blocks to copy the file. Maybe I'll write a simple copy application with different sizes of buffer to see if there's a connection.
(The kernel logs are without the read/write cry we should see the offset value in it)
I don't understand: have I messed up again or are these logs not available?
@the-debris OK thanks for all the details. I see more what's happening and what's the difference in your test.
I will try to reproduce as soon as I can.
Hi @the-debris
I tried to reproduce the issue on my VM. Mount the mirror, share the new storage, access by \\localhost\M
, copy the image to desktop folder. Procmon shows 1MB for each read and sequential offset π Am I missing something ?
Are you doing your test on a clean env ? Your idea of having a simple copy application could be interesting. Specially to reproduce the "shared" copy on the local device.
Hi @Liryna
No, I don't think you're missing anything. In my case, I checked the ReadFile ops from mirror.exe instead of explorer.exe. But it's true I didn't check if explorer.exe and mirror.exe offsets were consistant. Did you compare the resulting file with the original ? (I use BeyondCompare to do an hexadecimal comparison - but a simple SHA comparison is enough).
I realize I forgot to tell you that this issue is not systematic with a jpg: but you can reproduce it at least once if you make several copies (I known I did - in which case the corruption is clearly visible when you open the photo). The odds are better if you convert it to png first (I don't really understand why). On the other hand, it is systematic with a bin/dll (cf orig.bin in my first publication) - or any random Microsoft DLL with more than 10 MB should do the trick (orig.bin was renamed from node.dll, which comes from the portable version of Visual Studio Code). I am under the impression that Windows first analyzes the content of the file and adapts its processing. That would be the role of the "first step" I was talking about in my first reply, which is different between a jpg and a dll according to procmon.
Your idea of having a simple copy application could be interesting. Specially to reproduce the "shared" copy on the local device.
As a matter of fact, I did begin to implement that simple copy program. Actually 3 versions of it :
The first two versions do not reproduce the issue, no matter what buffer size I allocate (reminds me I didn't look what buffer fread uses internally). But the CopyFile version does reproduce the problem. It got me wondering if there was Overlaping involved in Microsoft's implementation. I actually wanted to investigate that lead before getting back to you.
Are you doing your test on a clean env ?
If you're asking if I just installed the OS, the answer is no. I could try to work with a clean VM as you are to see if there's an improvement. However, this issue occured on several computers (on W7 and W10).
Anyway thank you for taking time to help me :)
I did use sha256 but to be sure I have run BeyondCompare. Compare dll from system32 with directly the shared folder and after a copied file from the shared folder to local storage are all the same.
@the-debris From the mirror logs your provided, I think the best would be first to look at explorer / paint / ... offset and length and if they are correct, look at the kernel logs (it takes a long time to have everything printed so you might have "stopped" recording too earlier) https://github.com/dokan-dev/dokany/wiki/How-to-Debug-Dokan#get-logs
Try to get from the logs this output: https://github.com/dokan-dev/dokany/blob/master/sys/read.c#L131
With this we will know at which state the offset is corrupt.
Hi @Liryna Indeed the problem doesn't occur when mirror is executed from a brand new Windows 10 VM. I'll try to generate the logs you refer to from my environment. I will also look for a common denominator between the several environments where I observed the problem. I'll get back to you as soon as I have more info.
Hi @Liryna
I did some tests again on my configuration (same old copy from Explorer). When I copy from \localhost\M, this is what I observe on procmon (with a cleverer filter) : As you can see, explorer.exe uses the right offsets (on \localhost\M\WindowsCodecRaw.dll as well as M:\WindowsCodecRaw.dll). Then mirror.exe uses a wrong sequence of offsets with some delay. In the Dokan logs, the offsets seem to be correctly transmitted from explorer.exe (timestamps are consistant) :
00087203 26.57560158 [DokanFS] ==> DokanRead
00087204 26.57560539 [DokanFS] ProcessId 7364
00087205 26.57561111 [DokanFS] FileName: WindowsCodecsRaw.dll FCB.FileName: \WindowsCodecsRaw.dll
00087206 26.57561493 [DokanFS] ByteCount:1048576 ByteOffset:0
...
00087230 26.57618713 [DokanFS] ByteCount:1048576 ByteOffset:1048576
00087241 26.57673645 [DokanFS] ByteCount:1048576 ByteOffset:2097152
00087252 26.57725716 [DokanFS] ByteCount:1048576 ByteOffset:3145728
00087263 26.57781601 [DokanFS] ByteCount:1048576 ByteOffset:4194304
00087274 26.57837868 [DokanFS] ByteCount:1048576 ByteOffset:5242880
00087285 26.57892036 [DokanFS] ByteCount:1048576 ByteOffset:6291456
00087306 26.57946777 [DokanFS] ByteCount:1048576 ByteOffset:7340032
00087355 26.58164978 [DokanFS] ByteCount:1048576 ByteOffset:8388608
00087406 26.58510208 [DokanFS] ByteCount:1048576 ByteOffset:9437184
00087424 26.59011459 [DokanFS] ByteCount:1048576 ByteOffset:10485760
00087581 26.60300827 [DokanFS] ByteCount:1048576 ByteOffset:11534336
00087592 26.60322380 [DokanFS] ByteCount:1048576 ByteOffset:12582912
00087603 26.60338593 [DokanFS] ByteCount:1048576 ByteOffset:13631488
00087654 26.60891533 [DokanFS] ByteCount:1048576 ByteOffset:14680064
00087768 26.61700249 [DokanFS] ByteCount:1048576 ByteOffset:15728640
00087795 26.61784363 [DokanFS] ByteCount:1048576 ByteOffset:16777216
00087897 26.62084389 [DokanFS] ByteCount:1048576 ByteOffset:17825792
00087958 26.62479591 [DokanFS] ByteCount:1048576 ByteOffset:18874368
00087982 26.62604332 [DokanFS] ByteCount:1048576 ByteOffset:19922944
00088079 26.62792969 [DokanFS] ByteCount:1048576 ByteOffset:20971520
00088199 26.63558960 [DokanFS] ByteCount:1048576 ByteOffset:22020096
00088219 26.63583374 [DokanFS] ByteCount:1048576 ByteOffset:23068672
00088364 26.64263153 [DokanFS] ByteCount:1048576 ByteOffset:24117248
00088383 26.64284706 [DokanFS] ByteCount:1048576 ByteOffset:25165824
00088402 26.64301872 [DokanFS] ByteCount:1048576 ByteOffset:26214400
00088485 26.64669991 [DokanFS] ByteCount:1048576 ByteOffset:27262976
00088496 26.64684105 [DokanFS] ByteCount:1048576 ByteOffset:28311552
00088733 26.66102219 [DokanFS] ByteCount:1048576 ByteOffset:29360128
00088747 26.66132164 [DokanFS] ByteCount:1048576 ByteOffset:30408704
00088762 26.66156387 [DokanFS] ByteCount:1048576 ByteOffset:31457280
00088775 26.66384315 [DokanFS] ByteCount:1048576 ByteOffset:32505856
Furthermore, the mirror.exe operations are better intertwined with the explorer.exe ones when I copy from M:
I'm still trying to find out what could be different between my env and a clean VM. I'm currently investigating the firewall and the network configuration (although the traffic is not blocked but only disturbed...).
Humm interesting, you might have a filter driver that corrupts the offset. Can you use fltmc to list them and compare with your vm ? Maybe install the softwares little by little and find out who is the faulty filter driver. https://blogs.msdn.microsoft.com/ntdebugging/2013/03/25/understanding-file-system-minifilter-and-legacy-filter-load-order/
Otherwise, you are using the mirror x86/x64 version on your vm and your env ?
Hi @Liryna
There was not much difference between my VM and my env : only FsDepends was present on my env (with a normal altitude according to MS documentation) but not on my VM. I am however able to start it on my VM via SC and it doesn't help reproduce the issue. For now, I am only working with x64 OS and soft. Unfortunately, I haven't made any progress on the search for the faulty driver (it's a long process...). I hope I can get back to it this week.
Hi @Liryna ,
Unfortunately, I've been quite busy these last months and haven't had time for further investigation until now.
I didn't have much luck in searching for the way to reproduce the issue on a VM. Instead, I took the liberty to add my own logs in order to have a better understanding of what goes wrong and I believe I finally nailed it.
I attached to this post:
The _local versions are only here as a comparison point - the issue is once again visible in the _shared files. The points of interest are:
The problem seems to be located in the thread pool used in dokan/DokanMain. While the IOCTL_EVENT_WAIT requests are triggered in the right sequence by the different threads, the reply time varies on each thread, so that the read IRPs are not processed in the same order as they entered the user space. If I set my mirror.exe as single-threaded, the shared file is read correctly. However, this fix is not very satisfying from the performance point of view, as you can imagine. Anyway, the user-space workload distribution system has no notion of anteriority, which is problematic when processing a group of ordered operations.
I see several solutions:
Another way would be to figure out why a network latency causes a delay in the IOCTL_EVENT_WAIT processing. No luck for now.
In any case, I look forward to hear your opinion.
Hi @Liryna!
I see several solutions:
- force a thread B to wait for any previous thread A to complete its IOCTL_EVENT_WAIT --> that may fix the problem but I don't believe it to be a "full-proof" solution at all as it still does not garanty that thread A will finish its task before thread B
- force a thread B to wait for any previous thread A to complete its IOCTL_EVENT_INFO --> this garanties that the kernel will receive back the IRPs in the right sequence, but the chunks of file could still have been read in the wrong sequence, which is not a very good thing performance-wise
- force every operation belonging to the same "global action" to execute in sequence --> the best solution in my opinion, but probably the most difficult to implement... On which criteria(s) could this be applied: same MJ and same target object name? And how much would such a verification cost?
Actually I've been thinking. The 3rd solution, which I presented as the best, cannot work at all. I was so focused I couldn't see the obvious: when a thread knows what operation to process, it is already too late.
Instead, I would like to propose the following (which would be a mix between solutions 1 and 3 with a more significant architectural modification):
Hi @the-debris ,
It is normal to have IRP's event coming unordered. Even that they complete in the kernel unordered. The difference is if the reads comes from userland are:
Here you have multiple read in the same time so I would expect the userland app is doing multiple async reads to speedup and should concat the results correctly that might not be the case. Or dokan is mixing data during this async but hard to see how.
For information, Windows drivers are designed to be async even when a sync read is triggered, it will not mean our driver has to sync the request, this will be the IO Manager task.
I hope this was clear, does it make sense ?
Hi @Liryna As a matter of fact, it makes a lot of sense!
It is normal to have IRP's event coming unordered
That I understood
Even that they complete in the kernel unordered
That is precisely what I expected and I couldn't figure out why it wouldn't work. The thing is that is does work properly: I confirmed it by printing the first few bytes resulting from each read IRP.
sequential then we would not have any issue of corruption whatever the speed of completion
True. As I stated in a previous message, everything works fine when I use my own synchronous copy routine.
async, the userland app has to support non-ordered completion and build/concat the result is the good order
And you put the finger on the cause of the issue: the mirror programm (as well as my own) gets mixed up with its seek/read calls and doesn't read where it thinks it does. I also witnessed that by printing the read buffer.
I realize I have been tackling this problem at the wrong end. I thought the user side read the correct chunks while the kernel side reassembled them in the wrong sequence - and it turns out it was exactly the opposite... At least it has been an occasion to familiarize myself (a little bit) with the Windows kernel. Thank's a lot for your explanations.
Now I will maybe experiment on the mirror and rework my application to make it robust to asynchronous calls. Thank you again!
Glad you find out! It is always a pleaser to help π
Can this be closed ?
I believe you can. However I'd appreciate it if you could grant me 24h to be 100% sure I don't run into another snag while implementing my fix ;)
@the-debris OFC :) let me know when you feel confident about it
@Liryna All good π Closing the issue now
Not that somebody asked me but I got very similar problem with Kaspersky Security 10 for Windows Server 10.0.0.468 and its KLIF filter on Windows 2008 R2 SP1 x64 when the source folder is unc smb (debian/ext4, ro).
Adding key /t 1
to mirror.exe
helped!
I tried to increase the number of threads first and many other things but it is the only helpful.
@kay27 Thanks for the feedback!
So that would mean it is the read / write over the network made by the mirror that are the issue. Note: mirror was not built for that and if we want to support this case we should probably look into implementing the oplock calls in the mirror.
Note: mirror was not built for that and if we want to support this case we should probably look into implementing the oplock calls in the mirror.
I do realize, thanks for reminding me, it was just the only way to re-share the files from dedicated mirroring smb server. Built-in doesn't support junctions nor symlinks.
So that would mean it is the read / write over the network made by the mirror that are the issue.
Maybe yes. But I'm not sure if Kaspersky solutions are widely used... The only error I saw in debug log related to copying files is:
###QueryVolumeInfo 0016
GetVolumeInformation: unable to query underlying fs, using defaults. Last error = 123
All blocks was 1 MB, latest build from appvoyeor.
Environment
Check List
Description
Hi,
I tried to share a FUSE volume by activating Windows Properties/Sharing/Advanced sharing (with full control granted to everyone). When accessing it from the network (on the same machine \\127.0.0.1\ or on another machine on the network \\\), I can properly list the files present on the volume and their properties.
However, I keep retrieving corrupted data when I read from those files. An easy way to observe the problem is to copy the same file twice to another location and to compare the copies (with the original and one another). I attached an example of such files. On the binary level, it seems that 1 MB blocks (the size of the transfer buffer, I guess) are randomly either deleted or inserted. (I haven't been able to determine where the inserted data comes from) Consequently, this problem rarely occurs with small files but becomes systematic with large files (> 10 MB).
I applied the same procedure on the DOKAN mirror.exe sample which is deployed with the 1.3.0.1000 debug redistribuables. I got the same odd result. Since Cygwin DLLs are not deployed on my PC, the FUSE mirror.exe could unfortunately not launch. I can test that too if you believe that's necessary. But if I understand correctly, DOKAN relies on FUSE: so I think there is a problem with the FUSE read routine when Windows file sharing is involved.
Has anyone encountered this problem before or did I apply a wrong procedure? Thank you in advance for your reply :) B.
Logs
mirror.log copied_file_example.zip debug_view.LOG (is DokanFS [10120]?)