Open Rodrigo-Andrade opened 3 years ago
Most likely unrelated, but I noticed that EnsureThreadRequested
has a few minor codegen issues, here is a simplified repro:
private int field;
bool EnsureThreadRequested(int count)
{
int prev = Interlocked.CompareExchange(ref field, count + 1, count);
return prev == count;
}
Current codegen (RyuJIT, Windows x64):
cmp dword ptr [rcx], ecx ; <-- redundant nullcheck, see #44087
add rcx, 8
lea r8d, [rdx+1]
mov eax, edx
lock
cmpxchg dword ptr [rcx], r8d ; <-- cmpxchg modifes EFLAGS so we can get rid of the following cmp
cmp eax, edx
sete al
movzx rax, al
ret
Codegen on Mono-LLVM (Linux x64 ABI, AT&T):
push %rax
mov %esi,%eax
lea 0x1(%rax),%ecx
lock cmpxchg %ecx,0x10(%rdi)
sete %al
pop %rcx
retq
fyi @kouvel
In short bursty workloads the thread pool currently releases too many threads too quickly. It's a known issue, fixing it would involve some tradeoffs but probably would be a better tradeoff. I'm hoping to look into this for .NET 6.
@kouvel any workarounds for now?
Are you seeing this in 3.1 or 5? In .NET 5 there has been some improvement to time spent there in async socket operations, though it doesn't fix the root cause. I'm not aware of any other workarounds that would reduce the CPU time there.
Are you seeing this in 3.1 or 5?
Its dotnet5.
Using DOTNET_SYSTEM_NET_SOCKETS_INLINE_COMPLETIONS
envvar and UnsafePreferInlineScheduling
on kestrel transport greatly alleviated this (doubled the app performance). I guess its because this makes the app enqueue on the thread pool much less. But its a bit surprising this overhead has such a great impact (on my app at least).
I see, yea that would probably use the thread pool a lot less. The overhead probably means that the thread pool queue is remaining a relatively short length such that the active worker thread count remains below the processor count, and since the thread pool awakens more threads currently, it ends up taking the slower path to wake up a thread more frequently. The fix I have in mind would reduce that overhead but it wouldn't eliminate it.
Are you seeing better performance with those config options too, or is it just a CPU time issue?
When using dotnet-counters, the thread count on the pool was a stable 8 (on a 8 vcore machine), seldom going to 9. Unless the threads would go up and down so fast that the counter was no picking this up.
Are you seeing better performance with those config options too, or is it just a CPU time issue?
It doubled throughput, I am running with less than half machines now.
I had some stalling issues, but I will probably report this on another issue.
The thread count in counters is just the number of threads that have been created, there may be fewer that are active on average.
It doubled throughput, I am running with less than half machines now.
I see. I'm curious how close the prototype fix would get. Would you be able to try out a prototype fix? If so I can send you a patched runtime and some instructions on using it.
Sure!
Which OS would you prefer for testing it?
Btw DOTNET_SYSTEM_NET_SOCKETS_INLINE_COMPLETIONS
is only applicable to non-Windows platforms. I believe UnsafePreferInlineScheduling
would work on Windows too, but I'm guessing the results above were on Linux.
its running on amazon linux on a docker container using the mcr.microsoft.com/dotnet/aspnet:5.0.0 image
@kouvel as we are moving more apps to net5.0 the overhead is even more pronounced (i guess the changes to SocketAsyncEngine
is to blame) making the move from 3.1 to 5 yield no performance improvements, unless we use the inline options, but those are dangerous, so we can't enable them on all our apps.
Looks like the main problem is the fact that we run our apps alone (no other apps on the VM) with low cpu (15-25%). When we inline everything on the SocketAsyncEngine
, the performance skyrockets.
Maybe an easy fix was to be able to specify the minimum thread number? Since its one app per VM, having the minimum of one thread per core would be more efficient (i know that even then, less threads would be better for cache locality, but the overhead of waking up the threads offsets this by a big margin)
I'm about to build a prototype fix for you to try (sorry for the delay), couple of questions about that:
Did I understand correctly that the overhead has increased but the perf has not regressed (or improved) in 5.0 compared to 3.1?
we run our apps alone (no other apps on the VM) with low cpu (15-25%) Maybe an easy fix was to be able to specify the minimum thread number?
I see. How many procs are assigned to the VM/container? It's possible to configure the thread pool's min worker thread count to lower than the proc count. Set the following env vars before starting the app in the same shell:
export COMPlus_ThreadPool_ForceMinWorkerThreads=1
export COMPlus_HillClimbing_Disable=1
Disabling hill climbing as well since it doesn't work very well at low thread counts at the moment. Also if it matters note that the thread count specified above is treated as hexadecimal.
Which processor architecture are you using, x64/arm/arm64?
x64
For the app are you using a self-contained publish, or using the shared framework from the container image?
shared framework from the container image
Did I understand correctly that the overhead has increased but the perf has not regressed (or improved) in 5.0 compared to 3.1?
For cpu intensive apps it's about the same, for IO intensive, its worse.
How many procs are assigned to the VM/container?
Just the dotnet process
It's possible to configure the thread pool's min worker thread count to lower than the proc count.
Very nice! I'll try those new settings.
Thanks @Rodrigo-Andrade,
How many procs are assigned to the VM/container?
I meant how many processors are assigned to the VM/container? Wondering if it is already limited to a few processors or if it has several available and only the CPU utilization is limited.
8 vcpus
Here is a link to a patched System.Private.CoreLib.dll with a prototype fix: RequestFewerThreads50
To try it out, since you're using a container it probably would be easiest to temporarily replace that file in the shared framework in a temporary modification of the image. Find the path to dotnet
in the container image, the original file should be in <pathToDotnet>/shared/Microsoft.NETCore.App/5.0.0
. Replace that file and try out your scenarios. Also make sure to revert the changes to restore the original container image, since this is just a locally built patch and would not be supported.
You can try with and without the config vars above, though the prototype fix may not help much with the above config vars set. I'm hoping to at least see some reduction in the overhead in and under EnsureThreadRequested
, not sure if perf would also increase though. Let me know!
I'll play with it this weekend, thank you!
@kouvel just a preliminary result:
The orange line is the current app, dotnet 5 not patched using DOTNET_SYSTEM_NET_SOCKETS_INLINE_COMPLETIONS
and kestrel DangerousInlinePipeScheduling
which avoids most ThreadPool
usage (15%-18% cpu usage).
The blue line is the experiment. The first mound is your patch without any parameters (40%-44% cpu usage). The second mound is not patched without any parameters (stable in 50% cpu usage).
They all have the same request/s.
So it's an improvement.
I'll try to get some traces soon.
Thanks @Rodrigo-Andrade! It's a bit difficult to compare the blue to the orange line because the work done is a fair bit different, as the inline-completions mode bypasses pipelining and a bunch of other work that makes it unreliable, so there's more work being done in the blue line. Still though, it looks like there is room for improvement, curious to see how much the overhead under EnsureThreadRequested
has decreased with the patch.
Might also want to try with export COMPlus_HillClimbing_Disable=1
and without configuring the worker thread counts. Hill climbing can add a lot of threads unnecessarily and frequently, oversaturating the container, disabling it might have a noticeable effect in CPU-limited cases.
Sorry for taking so long.
Tested COMPlus_HillClimbing_Disable and it did not make a difference.
Here is some numbers that i got:
patched:
The impact in production is greater then this trace suggests.
Thanks @Rodrigo-Andrade. Since after the patch RequestWorkerThread
would typically only be called 2 times per batch of work, it looks like the workload is very bursty with short bursts and probably the additional path length of going through the thread pool and pipelining is significant relative to the other work being done. There may be ways to optimize those paths better, especially for CPU-constrained systems but that would probably be investigation for 6.0.
This is more like a loose question, feel free to close it if its too off-topic.
For sometime i always see
System.Threading.ThreadPoolWorkQueue.EnsureThreadRequested()
as a hotspot when profiling my code (a reverse proxy, we do plan to move to YARP once its more mature).Its always something like:
I see this on a 8 core VM, both in Windows and Linux. No thread starvation or something of the sort. Since that method seems to call to native code, i have no clue to why this overhead is so big.
Anything on the matter will as always will be greatly appreciated.