Closed amrmahdi closed 4 years ago
Existing code is using ProcessorCount as what @brendandburns called max_threads, which is a more useful number than being able to get the number of physical cores (especially in a container).
Sorry ... I wasn't clear. This change is a proposal, not a final plan. When I get back from holidays, I'm going to write up and publish a short paper (probably just a table) that outlines two or more proposals.
I have been pushing the idea of not reverting the change. This has turned out to be a good idea, because it forced an investigation that gave us a lot of information and insight. Having that gives us the right vantage point to decide what to do for both 3.1 and 5.0. This paper will make recommendations for both releases.
I have just updated to .NET Core 3.1.0 to test behavior on our environment and what I have found out confuses me.
AFAICT the CPU clipping is still present. Can anyone clear this up for me please? @VSadov
Here are .NET ThreadPool stats and processor count.
This is run on Kubernetes, the worker is a 4 core (with hyper threading, so 8 CPUs) machine running on CentOS7. On the left side, I have no CPU limit set in K8s, on the right side I have set it to 2000m
Source code for the metric above:
ThreadPool.GetMinThreads(out var minWorkerThreads, out var minCompletionPortThreads);
ThreadPool.GetMaxThreads(out var maxWorkerThreads, out var maxCompletionPortThreads);
ThreadPool.GetAvailableThreads(out var availableWorkerThreads, out var availableCompletionPortThreads);
ObserveThreadPoolStats(
minWorkerThreads,
minCompletionPortThreads,
maxWorkerThreads - availableWorkerThreads,
maxCompletionPortThreads - availableCompletionPortThreads,
Environment.ProcessorCount);
@mrmartan - I am not sure I understand what is shown on the picture and which part is unexpected.
A few things to consider:
minWorkerThreads
minWorkerThreads
can be dynamically adjusted higher (unless set by the user explicitly), based on throughput feedback. It does not mean that the actual number of active workers will immediately increase.Processor count dropping to 2. I believed that whole point of this issue is that 3.1 reports actual CPU count of the machine, i.e. no longer affected by cpu,cpuacct
cgroup cpu.cfs_period_us
and cpu.cfs_quota_us
as it was on 3.0. What I see above is 3.0 behavior.
I had a lot of trouble reading that image. I think I now understand it. Is that dotted line the CPU count? I assume the tile that is being shown is showing data from the left-hand side.
ProcessorCount should not be affected by quota. It can be affected by changing affinity. Just to rule that out - is it possible that affinity mask is being set in you scenario?
@richlander yes, the red line is processor count. It is overlapping with the purple line so its dotted. I am sorry about that. I assumed it easy to read.
@VSadov I can try and look into that. I am learning now that a lot depends on actual versions of Kubernetes, Docker and Linux (espeacially CentOS/RHEL seem to have quite a lot specifics). All I am setting myself though are k8s CPU limits and requests.
So, first off, I have to apologize.What I wrote in https://github.com/dotnet/runtime/issues/622#issuecomment-572643790 was my bad. A colleague of mine overwrote my deployment with 3.0 build.
On the other hand I made sure all my subsequent tests were correct and that all the pods are being scheduled on the same k8s worker (with 16 CPUs). I have made some observations with limiting CPU on 3.1 and while none of it was done in production under production loads I'd say 3.1 is NOT breaking for us.
What's of particular interest though is that 3.0 build has better performance than 3.1.On the following graphs is the same application, same workload, same configuration, same worker, the only difference is that one build (left side) is on dotnet 3.1 and the other (right side) on dotnet 3.0. (don't let the colors on the POD CPU usage graph confuse you, they do not match the other graphs)
Thanks for validating!
Could you try this same exercise with the build that I shared in December? https://hub.docker.com/r/richlander/aspnet. It has a fix in it to the thread pool that aligns minthreads with CPU quota. We believe that this reduces contention, and have seen that it collapses most of the regression.
@mrmartan your observation match with what was said. ProcessorCount in 3.1 no longer takes into account cpu quota. And like @amrmahdi you're seeing reduced performance.
The breaking part of the issue is that ProcessorCount now returns something different, and that it affects performance.
@richlander existing code is using ProcessorCount as an indicator for parallelism. That's why we see these performance regressions. Even if you fix .NET Core, other code will regress too. Benchmarks of @mrmartan and @amrmahdi show that taking cpu quota into account makes a better heuristic for parallelism.
Even if the new implementation is more correct, having an indicator of parallelism is far more useful.
@richlander existing code is using ProcessorCount as an indicator for parallelism. That's why we see these performance regressions.
Our internal investigation has not been able to demonstrate this. It's the obvious conclusion, but we are yet to prove that.
Current status:
COMPlus_GCHeapCount
to a low value, likely matching the CPU quota, or use workstation GC by setting COMPlus_gcserver
to 0. FYI: This is not our long-term plan.I know people are passionate about this. Our challenge is that we have not seen data that clearly sets a direction. If people want to work closely and/or privately with us with real world apps, we are interested and motivated. We're also not closing the door to any solution, but do require good data to make changes. We also took a hard run at epoll threads. @stephentoub then reminded us that you had already done that @ https://github.com/dotnet/corefx/pull/36693.
I know that the ProcessorCount change was a break, but it's only a break since 3.0 so it isn't particularly convincing in-and-of itself.
We DEFINITELY want to do the RIGHT THING. We're trying very hard to determine what that is.
Our internal investigation has not been able to demonstrate this. It's the obvious conclusion, but we are yet to prove that.
The ProcessorCount related changes you're making seem to confirm this.
I know that the ProcessorCount change was a break, but it's only a break since 3.0 so it isn't particularly convincing in-and-of itself.
Since 3.1.
If people want to work closely and/or privately with us with real world apps, we are interested and motivated.
Are there real world apps that are working better with this change?
@richlander
Could you try this same exercise with the build that I shared in December? https://hub.docker.com/r/richlander/aspnet.
I will try that.
- There are two main problems that have been observed: high memory costs due to fixed per GC-heap costs, throughput regressions.
- For reducing memory costs, we recommend configuring the GC by setting COMPlus_GCHeapCount to a low value, likely matching the CPU quota, or use workstation GC by setting COMPlus_gcserver to 0. FYI: This is not our long-term plan.
That was my impression as well. But my tests do not validate that. I have run them on 3.1 with all combinations of CPU quota (no limit and 2000m
) and GCHeapCount (2
and not set). In all cases limiting the heap count led to worse performance. While with both unlimited memory consumption was barely 10% higher. @tmds This is what I meant by breaking as unlimiting the CPU on netcore 2.2 lead to roughly quadrupled memory consumption (due to GC heaps) for my particular case.
I also don't think that CPU quota should be the mechanism to govern parallelism (real CPU count/affinity propably should) though there need to be some other means of control for parallelism in dotnet.
My current aim is to run all applicatioms in Kubernetes without CPU limits and use other means of governance in Kubernetes to manage resources. Whether that's a good idea remains to be seen but dotnet should not require me to set CPU quota, IMHO it should be able to run optimally in shared environments without CPU quota or any such measures.
A significant memory regression will only (per our understanding) repro in environments with large core counts. How many cores are in your test environment?
Yes, please test that build. Your findings will be super interesting.
We DEFINITELY want to do the RIGHT THING.
@richlander this is my view on 'right'
Agreed. But that's just part of it, and to some degree the easy part. The crux of our challenge and investigation has been determining what the best default behavior should be for the runtime, and to a lesser degree the class library, in this mode.
Agreed. But that's just part of it, and to some degree the easy part. The crux of our challenge and investigation has been determining what the best default behavior should be for the runtime, and to a lesser degree the class library, in this mode.
To me it looks like you're trying to find out what needs to change in the runtime to compensate for ProcessorCount no longer taking into account cpu quota. Ideally, runtime and class library use the same 'recommended level of parallelism'.
The importance of taking into account cpu quota is because it does have an effect on parallelism. Especially on containers with low cpu quota, which is a common configuration in all production Kubernetes deployments I've seen.
For 3.1, it makes sense to revert https://github.com/dotnet/coreclr/pull/26153 imo. There were no issues reported for it, and it has worked well for 2.x and 3.0.
For .NET 5 we can see if we can relax the impact of cpu quota and how that improves performance. And maybe add some new APIs.
2.x and 3.1 have same behavior. The CPU quota clipping was in 3.0 only and several customers reported it as a problem when upgrading from 2.x. That's why we have reverted it in 3.1.
The CPU quota clipping was in 3.0 only
Isn't this clipping in 2.1: https://github.com/dotnet/coreclr/blob/release/2.1/src/classlibnative/bcltype/system.cpp#L337-L342
@richlander That private build of yours seems to be indeed better. Can we expect a 3.1.x (even prerelease) build with this soon?
At 13:05 I switched from dotnet/core/aspnet:3.0.1
to richlander/aspnet:latest
The dips in RPM around 13:30 and 13:45 were caused by environment.
Thread count
is the number of process threads as reported by OS (i.e. it is not a dotnet metric).
EDIT: It even has a better memory profile (I do not control Heap count in these tests).
Isn't this clipping in 2.1: https://github.com/dotnet/coreclr/blob/release/2.1/src/classlibnative/bcltype/system.cpp#L337-L342
I checked with Microsoft .NET Core docker images. 2.x is clipping (like 3.0).
Isn't this clipping in 2.1:
You are right. Thanks for pointing it out. I was under wrong impression that we did not have the clipping in 2.1 and added in 3.0. I have misread the 3.0 change https://github.com/dotnet/coreclr/pull/23747/. This change added clipping in another part of the system, but not for Environment.ProcessorCount
. So either this change is causing the regressions that some customers were observing when migrating from 2.x to 3.0, or there is something else we do not understand yet.
This is a good argument for reverting the clipping changes and, as the next step, investigating additional fixes for the issues that people reported on top of the 2.x/3.0 baseline.
The PR for reintroducing the clipping: https://github.com/dotnet/coreclr/pull/27990
I'm glad we got the ProcessorCount
changes figured out, and can keep 3.1 the same as previous versions.
Good luck hunting for the 2.x/3.0 regression.
Fixed in 3.1.2
FYI: Some folks were wanting more control over the CPU scaling behavior, essentially "please don't clip". We now have another option for you. See: https://github.com/dotnet/runtime/issues/48094. This change will be in .NET 6 Preview 5.
We are trying to upgrade to .NET Core 3.1. But we that
Environment.ProcessorCount
reports different values on .NET Core 3.1.We use the official SDK docker images. I've attached 2 sample projects, 1 for 3.0 and 1 for 3.1
Repro
To repro use the following command to run:
Outcome
3.0
The number of processors on this computer is 1.
3.1
The number of processors on this computer is <actual_node_cpus>.
So if a machine has 8 cores, in
3.0
a container assigned 1 core we still got an outcome of1
, while in3.1
the outcome is8
.Is this change by design ?
Were are also seeing much higher cpu consumption on the 3.1 containers, our initial theory that the runtime is thinking it has more cores while it does not.
repro.zip