high threadpool contention when running on Linux

nathana1 commented 8 years ago

On our 48 core machine we are not able to push about 90% cpu utilization on some simple ASP.NET core scenarios (TechEmpower plaintext). We saturate the cpu and have corresponding better throughput on Windows. During investigation we've had a few theories but no concrete answers yet:

a lot of blocked time is being spent in ThreadPoolMgr::WorkerThreadStart and ThreadPoolMgr::MaybeAddWorkingWorker
potentially too many threads are being added to the thread pool on Linux
Windows seems to distribute network interrupts over several cores on both sockets of the machine but on Linux the interrupts are primarily on one core

benaadams commented 8 years ago

Was seeing something similar here: https://github.com/dotnet/coreclr/issues/6132

Also have some threadpool tweaks that may or maynot help here: https://github.com/dotnet/coreclr/pull/5943

Would be interested on how they work on a 48 core machine if you have the time?

nathana1 commented 8 years ago

Sure I can give it a shot when I have a chance.

benaadams commented 8 years ago

This is what I see on Windows with VTune; there's a lot of contention in WorkerThreadStart

WorkerThreadStart-1

WorkerThreadStart-2

benaadams commented 8 years ago

Which looks like ThreadPoolMgr::WorkerThreadStart line 2313 which is:

counts = WorkerCounter.GetCleanCounts();

Which goes to line 380 in win32threadpool.h

result.AsLongLong = FastInterlockCompareExchangeLong(&counts.AsLongLong, 0, 0);

/cc @kouvel @jkotas

benaadams commented 8 years ago

Getting the counts is the same issue in ThreadPoolMgr::MaybeAddWorkingWorker line 1273

MaybeAddingWorker

Which is

ThreadCounter::Counts counts = WorkerCounter.GetCleanCounts();

93% Back-end bound is L1 bound - Loads Blocked by Store Forwarding for MaybeAddingWorker

benaadams commented 8 years ago

Resolved by https://github.com/dotnet/coreclr/pull/6516

nathana1 commented 8 years ago

I'm not seeing much difference with either dotnet/coreclr#6516 or dotnet/coreclr#5943 on Linux on the 48 core. Nothing jumping out in the profiles. I'll take a look at windows next.

nathana1 commented 8 years ago

@kouvel I just tested this with dotnet/coreclr#6516 and it didn't make any difference.

benaadams commented 8 years ago

Was auto-close as it said resolve in body (been testing on Win)

ghost commented 8 years ago

Ben. since VTune is available for Linux (https://software.intel.com/en-us/intel-vtune-amplifier-xe), can we profile before and after dotnet/coreclr#6516 merge? We may also able to spot the culprit by that.

benaadams commented 8 years ago

@jasonwilliams200OK was issue I was having on Windows that was the same functions (WorkerThreadStart & MaybeAddWorkingWorker); don't have a linux set up to look at.

I assume Linux would respond to _WIN64 and DECLSPEC_ALIGN(64) but it might not - also might be different issue.

kouvel commented 8 years ago

The macros seem to be working on Linux

benaadams commented 8 years ago

Hopefully dotnet/coreclr#6516 addressed the high time in WorkerThreadStart & MaybeAddWorkingWorker on linux (as well as Windows); however there may be a secondary issue causing the non-100% CPU utilisation observed.

sdmaclea commented 7 years ago

@benaadams Testing on arm64 showed ThreadpoolMgr::MinLimitTotalWorkerThreads was set too high. On SkyLake 8 core processor, performance was peaked by setting to 4 threads. Similar result on arm64 48 core machine.

I would recommend changing

MinLimitTotalWorkerThreads = forceMin > 0 ? (LONG)forceMin : (LONG)NumberOfProcessors;

To something like

MinLimitTotalWorkerThreads = forceMin > 0 ? (LONG)forceMin : min((LONG)NumberOfProcessors, 4);

benaadams commented 7 years ago

Similar result on arm64 48 core machine.

4 threads on 48 core machine seems like a very poor utilization?

sdmaclea commented 7 years ago

@benaadams Poor utilization.

If there are typically sufficient work items to keep 48 threads busy, 4 threads would be poor utilization. However as is typical an application is often not highly parallelizable/scalable and there are only a few work items, 48 threads would be poor utilization. It either creates too many spinners or waiters are too cold (due to LRU wake).

I will continue to try to scale the threadpool design to efficiently use 48 cores...

sdmaclea commented 7 years ago

Also note that in many cases threadpool will not be the only method to create additional threads. libuv is spawning threads to service sockets. gcServer=1 has one thread per CPU to do GC. There may be other ThreadPool instances. It seems naive to assume ThreadPool is optimized when there are at least NumberOfProcessors threads in the pool.

Right now MinLimitTotalWorkerThreads is serving two purposes. It sets the initial thread count for the hill climbing algorithm (poorly documented). It also set the minimum threads for the hill climbing algorithm. while there is a thread cost function for the hill climbing algo, it does not allow the algo to reduce threads below MinLimitTotalWorkerThreads.

sdmaclea commented 7 years ago

@benaadams @stephentoub One of the hot spots I see in 48 core is the Dequeue() in ThreadPool.cs. the issue is that it must iterate over all other WorkStealingQueues when there is no work. It looks to me that it is not really scalable, arguably O(N^2). Every CPU has to poll every queue looking for work. Seems like disabling/redesigning WorkStealingQueue would be in order. Especially since it is only currently used by UnsafeQueueCustomWorkItem(forceGlobal: false).

stephentoub commented 7 years ago

Especially since it is only currently used by UnsafeQueueCustomWorkItem(forceGlobal: false).

Which is used by Tasks by default any time they're queued.

the issue is that it must iterate over all other WorkStealingQueues when there is no work

Of course at that point it's because there's limited work available (unless you have a very unbalanced workload, e.g. where one pool thread is a pure producer and another is a pure consumer.

Seems like disabling/redesigning WorkStealingQueue would be in order.

Some discussion here: https://github.com/dotnet/coreclr/issues/5930

sdmaclea commented 7 years ago

Of course at that point it's because there's limited work available

Which was why I suggested MinLimitTotalWorkerThreads should not default to NumberOfProcessors, but should be allowed to be something smaller at least until the O(n^2) issue is resolved. There should be a N^2 term for thread cost in the HillClimbing algorithm.

The problem is pretty extreme. With COMPlus_ThreadPool_ForceMinWorkerThreads=5 I see 20% better throughput with 40% less CPU usage. And there is nothing preventing the hill climbing algorithm from scaling the number of threads appropriately.

I will continue to investigate this. I still think UnfairSemaphore may be also responsible for some of the issue.

kouvel commented 7 years ago

I also noticed that some types of work items run more efficiently on fewer cores when hyperthreading is involved. A problem with lowering the min threads (or allowing hill climbing to reduce the number of threads below the proc count) is that for scenarios that run optimally when the number of threads equals the number of procs, hill climbing would frequently reduce the number of threads below the proc count to test it and slow down that scenario. That change may need some tweaking to hill climbing, maybe something like decreasing the frequency and/or magnitude of thread count decreases when the thread count is below the proc count, but to still allow it. There will probably be some tradeoff, but I think it could be minimized.

benaadams commented 7 years ago

@sdmaclea Is most contention in the scenario where there aren't enough work items for the threads; so most of the time is spent in a fruitless search?

I'm wondering if something better can be done with the capped count maintained with EnsureThreadRequested and MarkThreadRequestSatisfied; to also work with the total of items queued, rather than proc capped, so if there aren't any items queued the thread could pick up it would go back to sleep rather than searching.

benaadams commented 7 years ago

@stephentoub @kouvel am away from a computer that can compile any code; and this is likely very racey but would something along these principles help to reduce contention (e.g. searching) when the number of queued items is low? https://github.com/benaadams/coreclr/commit/1912680e27cb4f21a69c751020e31867a25622fd

sdmaclea commented 7 years ago

dotnet/coreclr#13007 is the fix I drafted for this.

sdmaclea commented 7 years ago

@benaadams

@sdmaclea Is most contention in the scenario where there aren't enough work items for the threads; so most of the time is spent in a fruitless search?

Yes. dotnet/coreclr#13274 is my simplistic answer for this. I strated with something like you r gist above, but evolved to this as it was lighter weight. If we were willing to count queued items, we could do better.

kouvel commented 7 years ago

I'm assuming @sdmaclea's fix had solved this, closing in favor of https://github.com/dotnet/coreclr/issues/14017 to complete the work

dotnet / runtime

high threadpool contention when running on Linux #6392