Closed nathana1 closed 4 years ago
Was seeing something similar here: https://github.com/dotnet/coreclr/issues/6132
Also have some threadpool tweaks that may or maynot help here: https://github.com/dotnet/coreclr/pull/5943
Would be interested on how they work on a 48 core machine if you have the time?
Sure I can give it a shot when I have a chance.
This is what I see on Windows with VTune; there's a lot of contention in WorkerThreadStart
Which looks like ThreadPoolMgr::WorkerThreadStart line 2313 which is:
counts = WorkerCounter.GetCleanCounts();
Which goes to line 380 in win32threadpool.h
result.AsLongLong = FastInterlockCompareExchangeLong(&counts.AsLongLong, 0, 0);
/cc @kouvel @jkotas
Getting the counts is the same issue in ThreadPoolMgr::MaybeAddWorkingWorker line 1273
Which is
ThreadCounter::Counts counts = WorkerCounter.GetCleanCounts();
93% Back-end bound is L1 bound - Loads Blocked by Store Forwarding for MaybeAddingWorker
Resolved by https://github.com/dotnet/coreclr/pull/6516
I'm not seeing much difference with either dotnet/coreclr#6516 or dotnet/coreclr#5943 on Linux on the 48 core. Nothing jumping out in the profiles. I'll take a look at windows next.
@kouvel I just tested this with dotnet/coreclr#6516 and it didn't make any difference.
Was auto-close as it said resolve in body (been testing on Win)
Ben. since VTune is available for Linux (https://software.intel.com/en-us/intel-vtune-amplifier-xe), can we profile before and after dotnet/coreclr#6516 merge? We may also able to spot the culprit by that.
@jasonwilliams200OK was issue I was having on Windows that was the same functions (WorkerThreadStart
& MaybeAddWorkingWorker
); don't have a linux set up to look at.
I assume Linux would respond to _WIN64
and DECLSPEC_ALIGN(64)
but it might not - also might be different issue.
The macros seem to be working on Linux
Hopefully dotnet/coreclr#6516 addressed the high time in WorkerThreadStart
& MaybeAddWorkingWorker
on linux (as well as Windows); however there may be a secondary issue causing the non-100% CPU utilisation observed.
@benaadams Testing on arm64 showed ThreadpoolMgr::MinLimitTotalWorkerThreads was set too high. On SkyLake 8 core processor, performance was peaked by setting to 4 threads. Similar result on arm64 48 core machine.
I would recommend changing
MinLimitTotalWorkerThreads = forceMin > 0 ? (LONG)forceMin : (LONG)NumberOfProcessors;
To something like
MinLimitTotalWorkerThreads = forceMin > 0 ? (LONG)forceMin : min((LONG)NumberOfProcessors, 4);
Similar result on arm64 48 core machine.
4 threads on 48 core machine seems like a very poor utilization?
@benaadams Poor utilization.
If there are typically sufficient work items to keep 48 threads busy, 4 threads would be poor utilization. However as is typical an application is often not highly parallelizable/scalable and there are only a few work items, 48 threads would be poor utilization. It either creates too many spinners or waiters are too cold (due to LRU wake).
I will continue to try to scale the threadpool design to efficiently use 48 cores...
Also note that in many cases threadpool will not be the only method to create additional threads. libuv is spawning threads to service sockets. gcServer=1 has one thread per CPU to do GC. There may be other ThreadPool instances. It seems naive to assume ThreadPool is optimized when there are at least NumberOfProcessors
threads in the pool.
Right now MinLimitTotalWorkerThreads
is serving two purposes. It sets the initial thread count for the hill climbing algorithm (poorly documented). It also set the minimum threads for the hill climbing algorithm. while there is a thread cost function for the hill climbing algo, it does not allow the algo to reduce threads below MinLimitTotalWorkerThreads
.
@benaadams @stephentoub One of the hot spots I see in 48 core is the Dequeue()
in ThreadPool.cs. the issue is that it must iterate over all other WorkStealingQueue
s when there is no work. It looks to me that it is not really scalable, arguably O(N^2). Every CPU has to poll every queue looking for work. Seems like disabling/redesigning WorkStealingQueue
would be in order. Especially since it is only currently used by UnsafeQueueCustomWorkItem(forceGlobal: false).
Especially since it is only currently used by UnsafeQueueCustomWorkItem(forceGlobal: false).
Which is used by Tasks by default any time they're queued.
the issue is that it must iterate over all other WorkStealingQueues when there is no work
Of course at that point it's because there's limited work available (unless you have a very unbalanced workload, e.g. where one pool thread is a pure producer and another is a pure consumer.
Seems like disabling/redesigning WorkStealingQueue would be in order.
Some discussion here: https://github.com/dotnet/coreclr/issues/5930
Of course at that point it's because there's limited work available
Which was why I suggested MinLimitTotalWorkerThreads
should not default to NumberOfProcessors
, but should be allowed to be something smaller at least until the O(n^2) issue is resolved. There should be a N^2 term for thread cost in the HillClimbing algorithm.
The problem is pretty extreme. With COMPlus_ThreadPool_ForceMinWorkerThreads=5
I see 20% better throughput with 40% less CPU usage. And there is nothing preventing the hill climbing algorithm from scaling the number of threads appropriately.
I will continue to investigate this. I still think UnfairSemaphore
may be also responsible for some of the issue.
I also noticed that some types of work items run more efficiently on fewer cores when hyperthreading is involved. A problem with lowering the min threads (or allowing hill climbing to reduce the number of threads below the proc count) is that for scenarios that run optimally when the number of threads equals the number of procs, hill climbing would frequently reduce the number of threads below the proc count to test it and slow down that scenario. That change may need some tweaking to hill climbing, maybe something like decreasing the frequency and/or magnitude of thread count decreases when the thread count is below the proc count, but to still allow it. There will probably be some tradeoff, but I think it could be minimized.
@sdmaclea Is most contention in the scenario where there aren't enough work items for the threads; so most of the time is spent in a fruitless search?
I'm wondering if something better can be done with the capped count maintained with EnsureThreadRequested
and MarkThreadRequestSatisfied
; to also work with the total of items queued, rather than proc capped, so if there aren't any items queued the thread could pick up it would go back to sleep rather than searching.
@stephentoub @kouvel am away from a computer that can compile any code; and this is likely very racey but would something along these principles help to reduce contention (e.g. searching) when the number of queued items is low? https://github.com/benaadams/coreclr/commit/1912680e27cb4f21a69c751020e31867a25622fd
dotnet/coreclr#13007 is the fix I drafted for this.
@benaadams
@sdmaclea Is most contention in the scenario where there aren't enough work items for the threads; so most of the time is spent in a fruitless search?
Yes. dotnet/coreclr#13274 is my simplistic answer for this. I strated with something like you r gist above, but evolved to this as it was lighter weight. If we were willing to count queued items, we could do better.
I'm assuming @sdmaclea's fix had solved this, closing in favor of https://github.com/dotnet/coreclr/issues/14017 to complete the work
On our 48 core machine we are not able to push about 90% cpu utilization on some simple ASP.NET core scenarios (TechEmpower plaintext). We saturate the cpu and have corresponding better throughput on Windows. During investigation we've had a few theories but no concrete answers yet: