Environment.ProcessorCount incorrect reporting in containers in 3.1

amrmahdi commented 4 years ago

We are trying to upgrade to .NET Core 3.1. But we that Environment.ProcessorCount reports different values on .NET Core 3.1.

We use the official SDK docker images. I've attached 2 sample projects, 1 for 3.0 and 1 for 3.1

Repro

To repro use the following command to run:

1. docker build . -f Dockerfile 
2. docker run --cpus=1 <image_built_from_#1>

Outcome

3.0

The number of processors on this computer is 1.

3.1

The number of processors on this computer is <actual_node_cpus>.

So if a machine has 8 cores, in 3.0 a container assigned 1 core we still got an outcome of 1, while in 3.1 the outcome is 8.

Is this change by design ?

Were are also seeing much higher cpu consumption on the 3.1 containers, our initial theory that the runtime is thinking it has more cores while it does not.

repro.zip

tmds commented 4 years ago

Existing code is using ProcessorCount as what @brendandburns called max_threads, which is a more useful number than being able to get the number of physical cores (especially in a container).

richlander commented 4 years ago

Sorry ... I wasn't clear. This change is a proposal, not a final plan. When I get back from holidays, I'm going to write up and publish a short paper (probably just a table) that outlines two or more proposals.

I have been pushing the idea of not reverting the change. This has turned out to be a good idea, because it forced an investigation that gave us a lot of information and insight. Having that gives us the right vantage point to decide what to do for both 3.1 and 5.0. This paper will make recommendations for both releases.

mrmartan commented 4 years ago

I have just updated to .NET Core 3.1.0 to test behavior on our environment and what I have found out confuses me.

AFAICT the CPU clipping is still present. Can anyone clear this up for me please? @VSadov

Here are .NET ThreadPool stats and processor count. This is run on Kubernetes, the worker is a 4 core (with hyper threading, so 8 CPUs) machine running on CentOS7. On the left side, I have no CPU limit set in K8s, on the right side I have set it to 2000m

Source code for the metric above:

ThreadPool.GetMinThreads(out var minWorkerThreads, out var minCompletionPortThreads);
ThreadPool.GetMaxThreads(out var maxWorkerThreads, out var maxCompletionPortThreads);
ThreadPool.GetAvailableThreads(out var availableWorkerThreads, out var availableCompletionPortThreads);

ObserveThreadPoolStats(
                    minWorkerThreads,
                    minCompletionPortThreads,
                    maxWorkerThreads - availableWorkerThreads,
                    maxCompletionPortThreads - availableCompletionPortThreads,
                    Environment.ProcessorCount);

VSadov commented 4 years ago

@mrmartan - I am not sure I understand what is shown on the picture and which part is unexpected.

A few things to consider:

minWorkerThreads is a number of threads that can be used. If there is little work, there could be any number of workers active - from 0 to minWorkerThreads
minWorkerThreads can be dynamically adjusted higher (unless set by the user explicitly), based on throughput feedback. It does not mean that the actual number of active workers will immediately increase.

mrmartan commented 4 years ago

Processor count dropping to 2. I believed that whole point of this issue is that 3.1 reports actual CPU count of the machine, i.e. no longer affected by cpu,cpuacct cgroup cpu.cfs_period_us and cpu.cfs_quota_us as it was on 3.0. What I see above is 3.0 behavior.

richlander commented 4 years ago

I had a lot of trouble reading that image. I think I now understand it. Is that dotted line the CPU count? I assume the tile that is being shown is showing data from the left-hand side.

VSadov commented 4 years ago

ProcessorCount should not be affected by quota. It can be affected by changing affinity. Just to rule that out - is it possible that affinity mask is being set in you scenario?

mrmartan commented 4 years ago

@richlander yes, the red line is processor count. It is overlapping with the purple line so its dotted. I am sorry about that. I assumed it easy to read.

@VSadov I can try and look into that. I am learning now that a lot depends on actual versions of Kubernetes, Docker and Linux (espeacially CentOS/RHEL seem to have quite a lot specifics). All I am setting myself though are k8s CPU limits and requests.

mrmartan commented 4 years ago

So, first off, I have to apologize.What I wrote in https://github.com/dotnet/runtime/issues/622#issuecomment-572643790 was my bad. A colleague of mine overwrote my deployment with 3.0 build.

On the other hand I made sure all my subsequent tests were correct and that all the pods are being scheduled on the same k8s worker (with 16 CPUs). I have made some observations with limiting CPU on 3.1 and while none of it was done in production under production loads I'd say 3.1 is NOT breaking for us.

What's of particular interest though is that 3.0 build has better performance than 3.1.On the following graphs is the same application, same workload, same configuration, same worker, the only difference is that one build (left side) is on dotnet 3.1 and the other (right side) on dotnet 3.0. (don't let the colors on the POD CPU usage graph confuse you, they do not match the other graphs)

richlander commented 4 years ago

Thanks for validating!

Could you try this same exercise with the build that I shared in December? https://hub.docker.com/r/richlander/aspnet. It has a fix in it to the thread pool that aligns minthreads with CPU quota. We believe that this reduces contention, and have seen that it collapses most of the regression.

tmds commented 4 years ago

@mrmartan your observation match with what was said. ProcessorCount in 3.1 no longer takes into account cpu quota. And like @amrmahdi you're seeing reduced performance.

The breaking part of the issue is that ProcessorCount now returns something different, and that it affects performance.

tmds commented 4 years ago

@richlander existing code is using ProcessorCount as an indicator for parallelism. That's why we see these performance regressions. Even if you fix .NET Core, other code will regress too. Benchmarks of @mrmartan and @amrmahdi show that taking cpu quota into account makes a better heuristic for parallelism.

Even if the new implementation is more correct, having an indicator of parallelism is far more useful.

richlander commented 4 years ago

@richlander existing code is using ProcessorCount as an indicator for parallelism. That's why we see these performance regressions.

Our internal investigation has not been able to demonstrate this. It's the obvious conclusion, but we are yet to prove that.

Current status:

There are two main problems that have been observed: high memory costs due to fixed per GC-heap costs, throughput regressions.
For reducing memory costs, we recommend configuring the GC by setting COMPlus_GCHeapCount to a low value, likely matching the CPU quota, or use workstation GC by setting COMPlus_gcserver to 0. FYI: This is not our long-term plan.
For the throughput regression, we are planning to update the threadpool to align min threads with the CPU quota. We have seen that this change reduces most of the regression (for our tests). That's the same fix as is @ https://hub.docker.com/r/richlander/aspnet. Anyone is free to test that build to validate the regression reduction we've seen.
We will treat this topic area as a high priority for .NET 5 and may back-port improvements to .NET Core 3.1 if the cost/benefit/risk trade-offs are good.

I know people are passionate about this. Our challenge is that we have not seen data that clearly sets a direction. If people want to work closely and/or privately with us with real world apps, we are interested and motivated. We're also not closing the door to any solution, but do require good data to make changes. We also took a hard run at epoll threads. @stephentoub then reminded us that you had already done that @ https://github.com/dotnet/corefx/pull/36693.

I know that the ProcessorCount change was a break, but it's only a break since 3.0 so it isn't particularly convincing in-and-of itself.

We DEFINITELY want to do the RIGHT THING. We're trying very hard to determine what that is.

tmds commented 4 years ago

Our internal investigation has not been able to demonstrate this. It's the obvious conclusion, but we are yet to prove that.

The ProcessorCount related changes you're making seem to confirm this.

I know that the ProcessorCount change was a break, but it's only a break since 3.0 so it isn't particularly convincing in-and-of itself.

Since 3.1.

If people want to work closely and/or privately with us with real world apps, we are interested and motivated.

Are there real world apps that are working better with this change?

mrmartan commented 4 years ago

@richlander

Could you try this same exercise with the build that I shared in December? https://hub.docker.com/r/richlander/aspnet.

I will try that.

There are two main problems that have been observed: high memory costs due to fixed per GC-heap costs, throughput regressions.

For reducing memory costs, we recommend configuring the GC by setting COMPlus_GCHeapCount to a low value, likely matching the CPU quota, or use workstation GC by setting COMPlus_gcserver to 0. FYI: This is not our long-term plan.

That was my impression as well. But my tests do not validate that. I have run them on 3.1 with all combinations of CPU quota (no limit and 2000m) and GCHeapCount (2 and not set). In all cases limiting the heap count led to worse performance. While with both unlimited memory consumption was barely 10% higher. @tmds This is what I meant by breaking as unlimiting the CPU on netcore 2.2 lead to roughly quadrupled memory consumption (due to GC heaps) for my particular case.

I also don't think that CPU quota should be the mechanism to govern parallelism (real CPU count/affinity propably should) though there need to be some other means of control for parallelism in dotnet.

My current aim is to run all applicatioms in Kubernetes without CPU limits and use other means of governance in Kubernetes to manage resources. Whether that's a good idea remains to be seen but dotnet should not require me to set CPU quota, IMHO it should be able to run optimally in shared environments without CPU quota or any such measures.

richlander commented 4 years ago

A significant memory regression will only (per our understanding) repro in environments with large core counts. How many cores are in your test environment?

Yes, please test that build. Your findings will be super interesting.

tmds commented 4 years ago

We DEFINITELY want to do the RIGHT THING.

@richlander this is my view on 'right'

There is a managed API that returns recommended level of parallelism.
CPU quota needs to be taken into account to some extent.

richlander commented 4 years ago

Agreed. But that's just part of it, and to some degree the easy part. The crux of our challenge and investigation has been determining what the best default behavior should be for the runtime, and to a lesser degree the class library, in this mode.

tmds commented 4 years ago

Agreed. But that's just part of it, and to some degree the easy part. The crux of our challenge and investigation has been determining what the best default behavior should be for the runtime, and to a lesser degree the class library, in this mode.

To me it looks like you're trying to find out what needs to change in the runtime to compensate for ProcessorCount no longer taking into account cpu quota. Ideally, runtime and class library use the same 'recommended level of parallelism'.

The importance of taking into account cpu quota is because it does have an effect on parallelism. Especially on containers with low cpu quota, which is a common configuration in all production Kubernetes deployments I've seen.

For 3.1, it makes sense to revert https://github.com/dotnet/coreclr/pull/26153 imo. There were no issues reported for it, and it has worked well for 2.x and 3.0.

For .NET 5 we can see if we can relax the impact of cpu quota and how that improves performance. And maybe add some new APIs.

jkotas commented 4 years ago

2.x and 3.1 have same behavior. The CPU quota clipping was in 3.0 only and several customers reported it as a problem when upgrading from 2.x. That's why we have reverted it in 3.1.

tmds commented 4 years ago

The CPU quota clipping was in 3.0 only

Isn't this clipping in 2.1: https://github.com/dotnet/coreclr/blob/release/2.1/src/classlibnative/bcltype/system.cpp#L337-L342

mrmartan commented 4 years ago

@richlander That private build of yours seems to be indeed better. Can we expect a 3.1.x (even prerelease) build with this soon?

At 13:05 I switched from dotnet/core/aspnet:3.0.1 to richlander/aspnet:latest
The dips in RPM around 13:30 and 13:45 were caused by environment. Thread count is the number of process threads as reported by OS (i.e. it is not a dotnet metric).

EDIT: It even has a better memory profile (I do not control Heap count in these tests).

tmds commented 4 years ago

Isn't this clipping in 2.1: https://github.com/dotnet/coreclr/blob/release/2.1/src/classlibnative/bcltype/system.cpp#L337-L342

I checked with Microsoft .NET Core docker images. 2.x is clipping (like 3.0).

jkotas commented 4 years ago

Isn't this clipping in 2.1:

You are right. Thanks for pointing it out. I was under wrong impression that we did not have the clipping in 2.1 and added in 3.0. I have misread the 3.0 change https://github.com/dotnet/coreclr/pull/23747/. This change added clipping in another part of the system, but not for Environment.ProcessorCount. So either this change is causing the regressions that some customers were observing when migrating from 2.x to 3.0, or there is something else we do not understand yet.

VSadov commented 4 years ago

This is a good argument for reverting the clipping changes and, as the next step, investigating additional fixes for the issues that people reported on top of the 2.x/3.0 baseline.

The PR for reintroducing the clipping: https://github.com/dotnet/coreclr/pull/27990

tmds commented 4 years ago

I'm glad we got the ProcessorCount changes figured out, and can keep 3.1 the same as previous versions. Good luck hunting for the 2.x/3.0 regression.

jkotas commented 4 years ago

Fixed in 3.1.2

richlander commented 3 years ago

FYI: Some folks were wanting more control over the CPU scaling behavior, essentially "please don't clip". We now have another option for you. See: https://github.com/dotnet/runtime/issues/48094. This change will be in .NET 6 Preview 5.

dotnet / runtime

Environment.ProcessorCount incorrect reporting in containers in 3.1 #622

Repro

Outcome

3.0

3.1