buildbarn / bb-storage

Storage daemon, capable of storing data for the Remote Execution protocol
Apache License 2.0
137 stars 91 forks source link

Fix Windows Remove and Link. #151

Closed kschzt closed 1 year ago

kschzt commented 1 year ago

I've got builder and runner working on Windows, and found two bugs:

Attn @mou-hao for the link one. I had lots of conflicts (file is being used style) with your solution but os.Link() doesn't error out. I'm not sure exactly what the issue was. Perhaps Symlinks could be made this same way?

mou-hao commented 1 year ago

Did you try using os.ReadDir() but keeping the hardlink implementation? I am curious whether it's just the problem of the readdirnames or there's problem with the hardlink implementation as well.

kschzt commented 1 year ago

@mou-hao I did. There are two separate issues that I found. The hardlinks work with os.Link(), but conflict with the original implementation. Unfortunately I couldn't find a root cause there, but it refused to make them because the file was "busy".

kschzt commented 1 year ago

@mou-hao I'm having to revisit this again, initially I developed this on a Win11Pro desktop, and of course the ACL situation is totally different on Windows 2019 Server, where I'm running it. I started getting "resource busy" errors again, and permission denied when cleaning up. I've gone back to your deletion routines with readdirnames() patched, and it can clean up. Your hardlink routine also seems to work just as well as os.Link(). On the W2019S system it just hits the hardlink limit soon for some reason (64 workers and cores), in which case I resort to just doing an io.Copy().

For some reason though, the throughput is very poor. I'm not sure exactly why, whether it's just NTFS, but I'm getting 2 actions/s on 64 cores... any ideas? Seems like it's heavily locked up or something.

so let's keep this open for now, or close and reopen later

mou-hao commented 1 year ago

Yeah this! You would need to have developer mode on or run as the administrator to use some of the file system features such as symbolic links and posix style delete behavior (delete while other processes are still using the file etc.). I originally tested these on a Windows 10 pro machine. I’m not sure if Windows 11 works any differently. I will try to recreate the readdirnames() issue and see if I can fix it without reverting to string based file system operations when I get the time to.

After that’s fixed, we still have the hardlink limit issue. That means we will need a softlink based solution for Windows? Or will we be better off using FUSE/NFS?

Another issue I’ve noticed when using MSVC with Bazel/Buildbarn that might be of interest to you is that MSVC itself does not seem to handle concurrency that well. When I set runner concurrency to anything larger than 1, I see lock-ups on pdb files. I tried this with Bazel 5.2 and VS 2019 though. Not sure if it’s still yht case. On Nov 11, 2022 18:24 +0800, Jaakko Manninen @.***>, wrote:

@mou-hao I'm having to revisit this again, initially I developed this on a Win11Pro desktop, and of course the ACL situation is totally different on Windows 2019 Server, where I'm running it. I started getting "resource busy" errors again, and permission denied when cleaning up. I've gone back to your deletion routines with readdirnames() patched, and it can clean up. Your hardlink routine also seems to work just as well as os.Link(). On the W2019S system it just hits the hardlink limit soon for some reason (64 workers and cores), in which case I resort to just doing an io.Copy(). For some reason though, the throughput is very poor. I'm not sure exactly why, whether it's just NTFS, but I'm getting 2 actions/s on 64 cores... so let's keep this open for now, or close and reopen later — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

kschzt commented 1 year ago

Closed this for now, let's approach the readdirnames() issue separately

kkpattern commented 1 year ago

https://learn.microsoft.com/en-us/windows/win32/projfs/projected-file-system

Maybe we can use projfs to create a virtual file system on windows?

kschzt commented 1 year ago

Something that would at least partially bypass NTFS + ACLs would be great. ProjFS backed by CAS directly? 🤔 Not sure how much can be bypassed though and whether it would still be slow (because NTFS)?

On 22. Nov 2022, at 13.57, Kai Zhang @.***> wrote:

 https://learn.microsoft.com/en-us/windows/win32/projfs/projected-file-system

Maybe we can use projfs to create a virtual file system on windows?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you modified the open/close state.

kkpattern commented 1 year ago

Something that would at least partially bypass NTFS + ACLs would be great. ProjFS backed by CAS directly? 🤔 Not sure how much can be bypassed though and whether it would still be slow (because NTFS)?

Maybe we can back ProjFS with local file system? When a file is needed, download it from CAS, write it into local filesystem then project it into the virtual root. We can set a max cache size and use some algorithms to remove outdated files. ProjFS is needed so we can simulate hard link in worker process.