dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.27k stars 4.73k forks source link

Should RegisteredWaitHandle always use Win32 TP? #90866

Open fbrosseau opened 1 year ago

fbrosseau commented 1 year ago

Hello,

A known limitation and significant design choice in Windows is the hardcoded limit of 64 (MAXIMUM_WAIT_OBJECTS) waitable objects in all kernel wait primitives (WaitForMultipleObjects and all derivatives). This limitation will never change. Waitable objects include events/semaphores obviously, but also processes, threads, and more. 

What ends up happening is that people who need to wait on very-many, or unbounded, number of waitable objects will shard wait requests into batches of 64 and have 1 thread per batch. This is exactly the implementation of dotnet's PortableThreadPool today. This technically scales infinitely, but one thread per 64 can end up meaning a lot, plus the fact every single register/unregister operation needs to wake up the WaiterThread so that it updates its wait list. 

One little-known gem of Windows 8 is that the kernel finally makes this truly scalable by adding the ability for registered waits to directly post a packet to a given IOCP, removing the need to have any thread at all. This is how CreateThreadpoolWait is implemented as of Win8+.

Applied to dotnet, because CreateThreadpoolWait implicitly binds the wait to the win32 threadpool's own IOCP, this means you would still need a tiny hoop between the Win32 threadpool callback, back into the dotnet threadpool, but this is still a clear improvement over having a custom pool of waiter threads that constantly need to be woken up to adjust their wait list. 

Considering AOT already has a formal implementation that uses CreateThreadpoolWait, it sounds like reusing that code in all cases for Windows should not increase the overall code complexity?

Sidenote1: Richter's book says CreateThreadpoolWait simply delegates to a pool of threads that wait for 64 items - this was true as of the publishing of that book (vista and win7), but is no longer true today. Windows Internals 7ed briefly confirms this new improvement in the threadpool chapter. 

Sidenote2: technically, the Win32 threadpool could be bypassed and registered waits could post directly to dotnet's own IOCP, making this even more efficient, but the relevant NT apis are undocumented and there is no Win32 API for this feature, other than CreateThreadPoolWait which forces you to have a jump through the Win32 pool.

Sidenote3: I am not breaking news about internal APIs (NtAssociateWaitCompletionPacket and all) here - you will find similar interest online for this feature from the tokio folks, golang folks, etc. There have also been public requests to the OS team to document the few NT APIs about this, which were denied, but I think the requests might not have fallen upon the right ears. Could this usecase justify asking OS team to document and/or add Win32 API? Even if not, I would still guesstimate that the Win32 threadpool jump still is better than the current custom implementation. 

Sidenote4: even legacy RegisterWaitForSingleObject benefits from this, but if you are going to wait multiple times in a row CreateThreadpoolWait amortizes the setup cost. 

Tagging @eduardo-vp since I saw that they made Win32 versus portable threadpool changes in recent history.

Thanks!

ghost commented 1 year ago

Tagging subscribers to this area: @mangod9 See info in area-owners.md if you want to be subscribed.

Issue Details
Hello, A known limitation and significant design choice in Windows is the hardcoded limit of 64 (MAXIMUM_WAIT_OBJECTS) waitable objects in all kernel wait primitives (WaitForMultipleObjects and all derivatives). This limitation will never change. Waitable objects include events/semaphores obviously, but also processes, threads, and more.  What ends up happening is that people who need to wait on very-many, or unbounded, number of waitable objects will shard wait requests into batches of 64 and have 1 thread per batch. This is exactly the implementation of dotnet's PortableThreadPool today. This technically scales infinitely, but one thread per 64 can end up meaning a lot, plus the fact every single register/unregister operation needs to wake up the WaiterThread so that it updates its wait list.  One little-known gem of Windows 8 is that the kernel finally makes this truly scalable by adding the ability for registered waits to directly post a packet to a given IOCP, removing the need to have _any thread at all_. This is how CreateThreadpoolWait is implemented as of Win8+. Applied to dotnet, because CreateThreadpoolWait implicitly binds the wait to the win32 threadpool's own IOCP, this means you would still need a tiny hoop between the Win32 threadpool callback, back into the dotnet threadpool, but this is still a clear improvement over having a custom pool of waiter threads that constantly need to be woken up to adjust their wait list.  Considering AOT already has a formal implementation that uses CreateThreadpoolWait, it sounds like reusing that code in all cases for Windows should not increase the overall code complexity? Sidenote1: Richter's book says CreateThreadpoolWait simply delegates to a pool of threads that wait for 64 items - this was true as of the publishing of that book (vista and win7), but is no longer true today. Windows Internals 7ed briefly confirms this new improvement in the threadpool chapter.  Sidenote2: technically, the Win32 threadpool could be bypassed and registered waits could post directly to dotnet's own IOCP, making this even more efficient, but the relevant NT apis are undocumented and there is no Win32 API for this feature, other than CreateThreadPoolWait which forces you to have a jump through the Win32 pool. Sidenote3: I am not breaking news about internal APIs (NtAssociateWaitCompletionPacket and all) here - you will find similar interest online for this feature from the `tokio` folks, `golang` folks, etc. There have also been public requests to the OS team to document the few NT APIs about this, which were denied, but I think the requests might not have fallen upon the right ears. Could this usecase justify asking OS team to document and/or add Win32 API? Even if not, I would still guesstimate that the Win32 threadpool jump still is better than the current custom implementation.  Sidenote4: even legacy RegisterWaitForSingleObject benefits from this, but if you are going to wait multiple times in a row CreateThreadpoolWait amortizes the setup cost.  Tagging @eduardo-vp since I saw that they made Win32 versus portable threadpool changes in recent history. Thanks!
Author: fbrosseau
Assignees: -
Labels: `area-System.Threading`, `tenet-performance`, `untriaged`
Milestone: -