dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.05k stars 4.69k forks source link

System.Threading.ThreadPoolWorkQueue.EnsureThreadRequested() hotspot #44449

Open Rodrigo-Andrade opened 3 years ago

Rodrigo-Andrade commented 3 years ago

This is more like a loose question, feel free to close it if its too off-topic.

For sometime i always see System.Threading.ThreadPoolWorkQueue.EnsureThreadRequested() as a hotspot when profiling my code (a reverse proxy, we do plan to move to YARP once its more mature).

Its always something like:

image

I see this on a 8 core VM, both in Windows and Linux. No thread starvation or something of the sort. Since that method seems to call to native code, i have no clue to why this overhead is so big.

Anything on the matter will as always will be greatly appreciated.

EgorBo commented 3 years ago

Most likely unrelated, but I noticed that EnsureThreadRequested has a few minor codegen issues, here is a simplified repro:

private int field;

bool EnsureThreadRequested(int count)
{
    int prev = Interlocked.CompareExchange(ref field, count + 1, count);
    return prev == count;
}

Current codegen (RyuJIT, Windows x64):

cmp      dword ptr [rcx], ecx  ; <-- redundant nullcheck, see #44087
add      rcx, 8
lea      r8d, [rdx+1]
mov      eax, edx
lock     
cmpxchg  dword ptr [rcx], r8d ; <-- cmpxchg modifes EFLAGS so we can get rid of the following cmp
cmp      eax, edx
sete     al
movzx    rax, al
ret 

Codegen on Mono-LLVM (Linux x64 ABI, AT&T):

push   %rax
mov    %esi,%eax
lea    0x1(%rax),%ecx
lock cmpxchg %ecx,0x10(%rdi)
sete   %al
pop    %rcx
retq
mangod9 commented 3 years ago

fyi @kouvel

kouvel commented 3 years ago

In short bursty workloads the thread pool currently releases too many threads too quickly. It's a known issue, fixing it would involve some tradeoffs but probably would be a better tradeoff. I'm hoping to look into this for .NET 6.

Rodrigo-Andrade commented 3 years ago

@kouvel any workarounds for now?

kouvel commented 3 years ago

Are you seeing this in 3.1 or 5? In .NET 5 there has been some improvement to time spent there in async socket operations, though it doesn't fix the root cause. I'm not aware of any other workarounds that would reduce the CPU time there.

Rodrigo-Andrade commented 3 years ago

Are you seeing this in 3.1 or 5?

Its dotnet5.

Using DOTNET_SYSTEM_NET_SOCKETS_INLINE_COMPLETIONS envvar and UnsafePreferInlineScheduling on kestrel transport greatly alleviated this (doubled the app performance). I guess its because this makes the app enqueue on the thread pool much less. But its a bit surprising this overhead has such a great impact (on my app at least).

kouvel commented 3 years ago

I see, yea that would probably use the thread pool a lot less. The overhead probably means that the thread pool queue is remaining a relatively short length such that the active worker thread count remains below the processor count, and since the thread pool awakens more threads currently, it ends up taking the slower path to wake up a thread more frequently. The fix I have in mind would reduce that overhead but it wouldn't eliminate it.

Are you seeing better performance with those config options too, or is it just a CPU time issue?

Rodrigo-Andrade commented 3 years ago

When using dotnet-counters, the thread count on the pool was a stable 8 (on a 8 vcore machine), seldom going to 9. Unless the threads would go up and down so fast that the counter was no picking this up.

Are you seeing better performance with those config options too, or is it just a CPU time issue?

It doubled throughput, I am running with less than half machines now.

I had some stalling issues, but I will probably report this on another issue.

kouvel commented 3 years ago

The thread count in counters is just the number of threads that have been created, there may be fewer that are active on average.

It doubled throughput, I am running with less than half machines now.

I see. I'm curious how close the prototype fix would get. Would you be able to try out a prototype fix? If so I can send you a patched runtime and some instructions on using it.

Rodrigo-Andrade commented 3 years ago

Sure!

kouvel commented 3 years ago

Which OS would you prefer for testing it?

kouvel commented 3 years ago

Btw DOTNET_SYSTEM_NET_SOCKETS_INLINE_COMPLETIONS is only applicable to non-Windows platforms. I believe UnsafePreferInlineScheduling would work on Windows too, but I'm guessing the results above were on Linux.

Rodrigo-Andrade commented 3 years ago

its running on amazon linux on a docker container using the mcr.microsoft.com/dotnet/aspnet:5.0.0 image

Rodrigo-Andrade commented 3 years ago

@kouvel as we are moving more apps to net5.0 the overhead is even more pronounced (i guess the changes to SocketAsyncEngine is to blame) making the move from 3.1 to 5 yield no performance improvements, unless we use the inline options, but those are dangerous, so we can't enable them on all our apps.

Looks like the main problem is the fact that we run our apps alone (no other apps on the VM) with low cpu (15-25%). When we inline everything on the SocketAsyncEngine, the performance skyrockets.

Maybe an easy fix was to be able to specify the minimum thread number? Since its one app per VM, having the minimum of one thread per core would be more efficient (i know that even then, less threads would be better for cache locality, but the overhead of waking up the threads offsets this by a big margin)

kouvel commented 3 years ago

I'm about to build a prototype fix for you to try (sorry for the delay), couple of questions about that:

Did I understand correctly that the overhead has increased but the perf has not regressed (or improved) in 5.0 compared to 3.1?

we run our apps alone (no other apps on the VM) with low cpu (15-25%) Maybe an easy fix was to be able to specify the minimum thread number?

I see. How many procs are assigned to the VM/container? It's possible to configure the thread pool's min worker thread count to lower than the proc count. Set the following env vars before starting the app in the same shell:

export COMPlus_ThreadPool_ForceMinWorkerThreads=1
export COMPlus_HillClimbing_Disable=1

Disabling hill climbing as well since it doesn't work very well at low thread counts at the moment. Also if it matters note that the thread count specified above is treated as hexadecimal.

Rodrigo-Andrade commented 3 years ago

Which processor architecture are you using, x64/arm/arm64?

x64

For the app are you using a self-contained publish, or using the shared framework from the container image?

shared framework from the container image

Did I understand correctly that the overhead has increased but the perf has not regressed (or improved) in 5.0 compared to 3.1?

For cpu intensive apps it's about the same, for IO intensive, its worse.

How many procs are assigned to the VM/container?

Just the dotnet process

It's possible to configure the thread pool's min worker thread count to lower than the proc count. Very nice! I'll try those new settings.

kouvel commented 3 years ago

Thanks @Rodrigo-Andrade,

How many procs are assigned to the VM/container?

I meant how many processors are assigned to the VM/container? Wondering if it is already limited to a few processors or if it has several available and only the CPU utilization is limited.

Rodrigo-Andrade commented 3 years ago

8 vcpus

kouvel commented 3 years ago

Here is a link to a patched System.Private.CoreLib.dll with a prototype fix: RequestFewerThreads50

To try it out, since you're using a container it probably would be easiest to temporarily replace that file in the shared framework in a temporary modification of the image. Find the path to dotnet in the container image, the original file should be in <pathToDotnet>/shared/Microsoft.NETCore.App/5.0.0. Replace that file and try out your scenarios. Also make sure to revert the changes to restore the original container image, since this is just a locally built patch and would not be supported.

You can try with and without the config vars above, though the prototype fix may not help much with the above config vars set. I'm hoping to at least see some reduction in the overhead in and under EnsureThreadRequested, not sure if perf would also increase though. Let me know!

Rodrigo-Andrade commented 3 years ago

I'll play with it this weekend, thank you!

Rodrigo-Andrade commented 3 years ago

@kouvel just a preliminary result: image

The orange line is the current app, dotnet 5 not patched using DOTNET_SYSTEM_NET_SOCKETS_INLINE_COMPLETIONS and kestrel DangerousInlinePipeScheduling which avoids most ThreadPool usage (15%-18% cpu usage).

The blue line is the experiment. The first mound is your patch without any parameters (40%-44% cpu usage). The second mound is not patched without any parameters (stable in 50% cpu usage).

They all have the same request/s.

So it's an improvement.

I'll try to get some traces soon.

kouvel commented 3 years ago

Thanks @Rodrigo-Andrade! It's a bit difficult to compare the blue to the orange line because the work done is a fair bit different, as the inline-completions mode bypasses pipelining and a bunch of other work that makes it unreliable, so there's more work being done in the blue line. Still though, it looks like there is room for improvement, curious to see how much the overhead under EnsureThreadRequested has decreased with the patch.

Might also want to try with export COMPlus_HillClimbing_Disable=1 and without configuring the worker thread counts. Hill climbing can add a lot of threads unnecessarily and frequently, oversaturating the container, disabling it might have a noticeable effect in CPU-limited cases.

Rodrigo-Andrade commented 3 years ago

Sorry for taking so long.

Tested COMPlus_HillClimbing_Disable and it did not make a difference.

Here is some numbers that i got:

image

patched: image

The impact in production is greater then this trace suggests.

kouvel commented 3 years ago

Thanks @Rodrigo-Andrade. Since after the patch RequestWorkerThread would typically only be called 2 times per batch of work, it looks like the workload is very bursty with short bursts and probably the additional path length of going through the thread pool and pipelining is significant relative to the other work being done. There may be ways to optimize those paths better, especially for CPU-constrained systems but that would probably be investigation for 6.0.