dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.38k stars 4.75k forks source link

YIELD_WHILE/DangerousNonHostedSpinLockHolder should be modernized #11563

Closed benaadams closed 3 weeks ago

benaadams commented 5 years ago

Looking at the Windows trace (from ASP.NET Core TE Platform Json, non-pipelined) in https://github.com/dotnet/coreclr/issues/12118#issuecomment-441962737; I'm was wondering what the 53.7% KernelBase.dll is that's dominating the TP Dispatch path; as the sleeping would be in CLRLifoSemaphore which are accounted for (the actual ThreadPool Dispatch is under the 8.76% ManagedPerAppDomainTPCount::DispatchWorkItem branch); so I was looking for something that wouldn't really show as a function call:

100%   ThreadpoolMgr::WorkerThreadStart  •  193,478 ms  •  coreclr.dll!ThreadpoolMgr::WorkerThreadStart
► 53.7%   KernelBase.dll  •  103,971 ms   <---- what's this?
► 22.3%   CLRLifoSemaphore::Wait  •  43,085 ms  •  coreclr.dll!CLRLifoSemaphore::Wait
  8.76%   ManagedPerAppDomainTPCount::DispatchWorkItem  •  16,951 ms  •  coreclr.dll!ManagedPerAppDomainTPCount::DispatchWorkItem
► 7.52%   CLRLifoSemaphore::WaitForSignal  •  14,550 ms  •  coreclr.dll!CLRLifoSemaphore::WaitForSignal
► 6.19%   [Unknown]  •  11,977 ms
► 0.74%   DestroyThread  •  1,433 ms  •  coreclr.dll!DestroyThread

While looking in ThreadpoolMgr::WorkerThreadStart I came across the use of a DangerousNonHostedSpinLockHolder macro https://github.com/dotnet/coreclr/blob/8aa0869eb9153429091fdba49469d89ec33092cb/src/vm/win32threadpool.cpp#L2158

Which its AcquireLock is a InterlockExchange wrapper https://github.com/dotnet/coreclr/blob/8aa0869eb9153429091fdba49469d89ec33092cb/src/vm/spinlock.h#L86-L90

around YIELD_WHILE macro

https://github.com/dotnet/coreclr/blob/8aa0869eb9153429091fdba49469d89ec33092cb/src/vm/spinlock.h#L54-L61

which is a busy loop around SwitchToThread; which falls back to Sleep(1) after 32,768 calls (or 5,120 calls on ARM)

https://github.com/dotnet/coreclr/blob/8aa0869eb9153429091fdba49469d89ec33092cb/src/vm/hosting.cpp#L685-L695

I don't think this has been revisited in a while as the code hasn't changed since Core was open sourced and the comments refer to .NET Compact Framework 2.0 Service Pack 1 (NDP 2.0 SP1 fix) and "As of early 2011, ARM CPUs are much slower"

https://github.com/dotnet/coreclr/blob/8aa0869eb9153429091fdba49469d89ec33092cb/src/vm/hosting.cpp#L670-L676

Only being a busy loop around SwitchToThread will only give up time to a thread scheduled on the same core and not having yields won't be nice for a thread on its hyper-threaded pair core?

Processors have changed quite a bit; and it should be at least Yielding the processor during the loop for HT machines, and probably calling Sleep(0) in-case the lock is held by a different core as many more cores are common than in 2011; or an alternative signalling strategy should be used (like the CLRLifoSemaphore)?

/cc @kouvel @stephentoub @tarekgh @jkotas @vancem @davidfowl

jkotas commented 5 years ago

I think this should just use regular lock that has all sort of smarts to deal with contention.

SpinLocks are ok for cases where the contention is extremely rare. It is clearly not the case for this lock as your trace demonstrated.

jkotas commented 5 years ago

The comments suggest that this was a spin lock to support SQL hosting interfaces (regular locks could not be used in all situations with SQL hosting interfaces). We do not support SQL hosting interfaces in CoreCLR.

davidfowl commented 5 years ago

Maybe we should just port the managed corert code for the thread pool back to coreclr 😄

jkoritzinsky commented 5 years ago

I would be totally up for porting over the corert thread pool back to coreclr! It's not like I'm biased or anything 😆

davidfowl commented 5 years ago

@jkoritzinsky I'm fully supportive of this 😄

kouvel commented 5 years ago

In using the Windows thread pool, investigations should be done to determine whether scheduling behavior between regular work items and I/O work items to see if a separate I/O thread pool would be necessary. For the Unix side, aside from fixes needed in that implementation, investigations should be done to see if using GCD where available instead would provide similar benefits to the Windows thread pool. There are other known issues that should ideally be fixed along with this work. I've been trying to get that work scheduled but it hasn't happened so far.

jkotas commented 5 years ago

using the Windows thread pool using GCD where available

Agree that there are a lot ideas for how we can make the threadpool better. Another one is whether it would be benefical to merge the regular and I/O threads in the built-in threadpool. All these workitems should be independent on the workitem to take the managed thread pool rewrite in C#.

davidfowl commented 5 years ago

Agreed, we shouldn't do that as part of the managed port and I do think having a thread pool that doesn't depend on the OS is goodness in general (for portability). It would be interesting to test the windows thread pool in isolation for certain scenarios though (say our techempower plaintext) by replacing scheduler implementations in certain places (TaskScheduler/PipeScheduler etc). That could give us a read on whether those approaches would be better or not.

benaadams commented 5 years ago

Unix also has its own I/O threads for Sockets System/Net/Sockets/SocketAsyncEngine.Unix.cs#L291-L349

One of the issues with having a single threadpool is users sticking blocking items on the threadpool that are waiting for I/O to unblock them and swallowing up all the threads that would also be needed to run the items that would do the unblocking.

Secondary is with the I/O events listeners which are put on the I/O threadpool themselves being blocking (GetQueuedCompletionStatus | epoll_wait) so would eat a thread.

Alternative approaches are definitely interesting to investigate for the future; but likely more churn, with working out what to do with threadpool statics that are now on unmanaged threads etc, however the CLR threadpool is a known quantity and moving that implmentation to managed code would remove a lot of the thunking that happens between managed and unmanaged code.

e.g.

Excluding the time spent in ancillary work around managing the ThreadPool threads activation/sleep and coordinating work item counts:

Direct callstack descent to execute managed code Dispatch loop

Thread::intermediateThreadProc 
  ThreadpoolMgr::WorkerThreadStart
    100%   ManagedPerAppDomainTPCount::DispatchWorkItem  •  50,156 ms  •  coreclr.dll!ManagedPerAppDomainTPCount::DispatchWorkItem
      99.8%   ManagedThreadBase_DispatchOuter  •  50,060 ms  •  coreclr.dll!ManagedThreadBase_DispatchOuter
        99.7%   ManagedThreadBase_DispatchMiddle  •  49,993 ms  •  coreclr.dll!ManagedThreadBase_DispatchMiddle
          99.5%   QueueUserWorkItemManagedCallback  •  49,883 ms  •  coreclr.dll!QueueUserWorkItemManagedCallback
            98.3%   MethodDescCallSite::CallTargetWorker  •  49,312 ms  •  coreclr.dll!MethodDescCallSite::CallTargetWorker
              97.9%   CallDescrWorkerInternal  •  49,088 ms  •  coreclr.dll!CallDescrWorkerInternal
              ► 97.6%   Dispatch  •  48,968 ms  •  System.Threading.ThreadPoolWorkQueue.Dispatch()

Direct callstack descent to execute IO Callback managed code

Thread::intermediateThreadProc 
  ThreadpoolMgr::WorkerThreadStart
    ThreadpoolMgr::CompletionPortThreadStart 
      100%   BindIoCompletionCallbackStub  •  853 ms  •  coreclr.dll!BindIoCompletionCallbackStub
        99.8%   BindIoCompletionCallbackStubEx  •  851 ms  •  coreclr.dll!BindIoCompletionCallbackStubEx
          99.3%   ManagedThreadBase_FullTransitionWithAD  •  847 ms  •  coreclr.dll!ManagedThreadBase_FullTransitionWithAD
            99.0%   ManagedThreadBase_DispatchOuter  •  845 ms  •  coreclr.dll!ManagedThreadBase_DispatchOuter
              98.7%   ManagedThreadBase_DispatchMiddle  •  842 ms  •  coreclr.dll!ManagedThreadBase_DispatchMiddle
                98.3%   BindIoCompletionCallBack_Worker  •  838 ms  •  coreclr.dll!BindIoCompletionCallBack_Worker
                  97.3%   DispatchCallSimple  •  830 ms  •  coreclr.dll!DispatchCallSimple
                    97.2%   CallDescrWorkerInternal  •  829 ms  •  coreclr.dll!CallDescrWorkerInternal
                    ► 96.9%   PerformIOCompletionCallback  •  826 ms  •  System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32, UInt32, NativeOverlapped*)
kouvel commented 5 years ago

All these workitems should be independent on the workitem to take the managed thread pool rewrite in C#.

True, in any case the managed implementation still be necessary as a fallback where the OS thread pool cannot be used or for other purposes.

benaadams commented 5 years ago

From the current platform pipeline TE benchmarks

image

image

dotnet-policy-service[bot] commented 1 month ago

Due to lack of recent activity, this issue has been marked as a candidate for backlog cleanup. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will undo this process.

This process is part of our issue cleanup automation.

dotnet-policy-service[bot] commented 3 weeks ago

This issue will now be closed since it had been marked no-recent-activity but received no further activity in the past 14 days. It is still possible to reopen or comment on the issue, but please note that the issue will be locked if it remains inactive for another 30 days.