Closed benaadams closed 3 weeks ago
I think this should just use regular lock that has all sort of smarts to deal with contention.
SpinLocks are ok for cases where the contention is extremely rare. It is clearly not the case for this lock as your trace demonstrated.
The comments suggest that this was a spin lock to support SQL hosting interfaces (regular locks could not be used in all situations with SQL hosting interfaces). We do not support SQL hosting interfaces in CoreCLR.
Maybe we should just port the managed corert code for the thread pool back to coreclr 😄
I would be totally up for porting over the corert thread pool back to coreclr! It's not like I'm biased or anything 😆
@jkoritzinsky I'm fully supportive of this 😄
In using the Windows thread pool, investigations should be done to determine whether scheduling behavior between regular work items and I/O work items to see if a separate I/O thread pool would be necessary. For the Unix side, aside from fixes needed in that implementation, investigations should be done to see if using GCD where available instead would provide similar benefits to the Windows thread pool. There are other known issues that should ideally be fixed along with this work. I've been trying to get that work scheduled but it hasn't happened so far.
using the Windows thread pool using GCD where available
Agree that there are a lot ideas for how we can make the threadpool better. Another one is whether it would be benefical to merge the regular and I/O threads in the built-in threadpool. All these workitems should be independent on the workitem to take the managed thread pool rewrite in C#.
Agreed, we shouldn't do that as part of the managed port and I do think having a thread pool that doesn't depend on the OS is goodness in general (for portability). It would be interesting to test the windows thread pool in isolation for certain scenarios though (say our techempower plaintext) by replacing scheduler implementations in certain places (TaskScheduler/PipeScheduler etc). That could give us a read on whether those approaches would be better or not.
Unix also has its own I/O threads for Sockets System/Net/Sockets/SocketAsyncEngine.Unix.cs#L291-L349
One of the issues with having a single threadpool is users sticking blocking items on the threadpool that are waiting for I/O to unblock them and swallowing up all the threads that would also be needed to run the items that would do the unblocking.
Secondary is with the I/O events listeners which are put on the I/O threadpool themselves being blocking (GetQueuedCompletionStatus
| epoll_wait
) so would eat a thread.
Alternative approaches are definitely interesting to investigate for the future; but likely more churn, with working out what to do with threadpool statics that are now on unmanaged threads etc, however the CLR threadpool is a known quantity and moving that implmentation to managed code would remove a lot of the thunking that happens between managed and unmanaged code.
e.g.
Excluding the time spent in ancillary work around managing the ThreadPool threads activation/sleep and coordinating work item counts:
Direct callstack descent to execute managed code Dispatch loop
Thread::intermediateThreadProc
ThreadpoolMgr::WorkerThreadStart
100% ManagedPerAppDomainTPCount::DispatchWorkItem • 50,156 ms • coreclr.dll!ManagedPerAppDomainTPCount::DispatchWorkItem
99.8% ManagedThreadBase_DispatchOuter • 50,060 ms • coreclr.dll!ManagedThreadBase_DispatchOuter
99.7% ManagedThreadBase_DispatchMiddle • 49,993 ms • coreclr.dll!ManagedThreadBase_DispatchMiddle
99.5% QueueUserWorkItemManagedCallback • 49,883 ms • coreclr.dll!QueueUserWorkItemManagedCallback
98.3% MethodDescCallSite::CallTargetWorker • 49,312 ms • coreclr.dll!MethodDescCallSite::CallTargetWorker
97.9% CallDescrWorkerInternal • 49,088 ms • coreclr.dll!CallDescrWorkerInternal
► 97.6% Dispatch • 48,968 ms • System.Threading.ThreadPoolWorkQueue.Dispatch()
Direct callstack descent to execute IO Callback managed code
Thread::intermediateThreadProc
ThreadpoolMgr::WorkerThreadStart
ThreadpoolMgr::CompletionPortThreadStart
100% BindIoCompletionCallbackStub • 853 ms • coreclr.dll!BindIoCompletionCallbackStub
99.8% BindIoCompletionCallbackStubEx • 851 ms • coreclr.dll!BindIoCompletionCallbackStubEx
99.3% ManagedThreadBase_FullTransitionWithAD • 847 ms • coreclr.dll!ManagedThreadBase_FullTransitionWithAD
99.0% ManagedThreadBase_DispatchOuter • 845 ms • coreclr.dll!ManagedThreadBase_DispatchOuter
98.7% ManagedThreadBase_DispatchMiddle • 842 ms • coreclr.dll!ManagedThreadBase_DispatchMiddle
98.3% BindIoCompletionCallBack_Worker • 838 ms • coreclr.dll!BindIoCompletionCallBack_Worker
97.3% DispatchCallSimple • 830 ms • coreclr.dll!DispatchCallSimple
97.2% CallDescrWorkerInternal • 829 ms • coreclr.dll!CallDescrWorkerInternal
► 96.9% PerformIOCompletionCallback • 826 ms • System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32, UInt32, NativeOverlapped*)
All these workitems should be independent on the workitem to take the managed thread pool rewrite in C#.
True, in any case the managed implementation still be necessary as a fallback where the OS thread pool cannot be used or for other purposes.
From the current platform pipeline TE benchmarks
Due to lack of recent activity, this issue has been marked as a candidate for backlog cleanup. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will undo this process.
This process is part of our issue cleanup automation.
This issue will now be closed since it had been marked no-recent-activity
but received no further activity in the past 14 days. It is still possible to reopen or comment on the issue, but please note that the issue will be locked if it remains inactive for another 30 days.
Looking at the Windows trace (from ASP.NET Core TE Platform Json, non-pipelined) in https://github.com/dotnet/coreclr/issues/12118#issuecomment-441962737; I'm was wondering what the
53.7% KernelBase.dll
is that's dominating the TP Dispatch path; as the sleeping would be inCLRLifoSemaphore
which are accounted for (the actual ThreadPool Dispatch is under the8.76% ManagedPerAppDomainTPCount::DispatchWorkItem
branch); so I was looking for something that wouldn't really show as a function call:While looking in
ThreadpoolMgr::WorkerThreadStart
I came across the use of aDangerousNonHostedSpinLockHolder
macro https://github.com/dotnet/coreclr/blob/8aa0869eb9153429091fdba49469d89ec33092cb/src/vm/win32threadpool.cpp#L2158Which its
AcquireLock
is aInterlockExchange
wrapper https://github.com/dotnet/coreclr/blob/8aa0869eb9153429091fdba49469d89ec33092cb/src/vm/spinlock.h#L86-L90around
YIELD_WHILE
macrohttps://github.com/dotnet/coreclr/blob/8aa0869eb9153429091fdba49469d89ec33092cb/src/vm/spinlock.h#L54-L61
which is a busy loop around
SwitchToThread
; which falls back to Sleep(1) after 32,768 calls (or 5,120 calls on ARM)https://github.com/dotnet/coreclr/blob/8aa0869eb9153429091fdba49469d89ec33092cb/src/vm/hosting.cpp#L685-L695
I don't think this has been revisited in a while as the code hasn't changed since Core was open sourced and the comments refer to .NET Compact Framework 2.0 Service Pack 1 (NDP 2.0 SP1 fix) and "As of early 2011, ARM CPUs are much slower"
https://github.com/dotnet/coreclr/blob/8aa0869eb9153429091fdba49469d89ec33092cb/src/vm/hosting.cpp#L670-L676
Only being a busy loop around
SwitchToThread
will only give up time to a thread scheduled on the same core and not having yields won't be nice for a thread on its hyper-threaded pair core?Processors have changed quite a bit; and it should be at least Yielding the processor during the loop for HT machines, and probably calling Sleep(0) in-case the lock is held by a different core as many more cores are common than in 2011; or an alternative signalling strategy should be used (like the CLRLifoSemaphore)?
/cc @kouvel @stephentoub @tarekgh @jkotas @vancem @davidfowl