Environment.ProcessorCount incorrect reporting in containers in 3.1

amrmahdi commented 4 years ago

We are trying to upgrade to .NET Core 3.1. But we that Environment.ProcessorCount reports different values on .NET Core 3.1.

We use the official SDK docker images. I've attached 2 sample projects, 1 for 3.0 and 1 for 3.1

Repro

To repro use the following command to run:

1. docker build . -f Dockerfile 
2. docker run --cpus=1 <image_built_from_#1>

Outcome

3.0

The number of processors on this computer is 1.

3.1

The number of processors on this computer is <actual_node_cpus>.

So if a machine has 8 cores, in 3.0 a container assigned 1 core we still got an outcome of 1, while in 3.1 the outcome is 8.

Is this change by design ?

Were are also seeing much higher cpu consumption on the 3.1 containers, our initial theory that the runtime is thinking it has more cores while it does not.

repro.zip

stephentoub commented 4 years ago

cc: @richlander, @janvorli

janvorli commented 4 years ago

It is by design. The --cpus doesn't limit the number of processors the process can run on. Even with --cpus=1, threads of your process can end up being scheduled on multiple CPUs. See https://docs.docker.com/config/containers/resource_constraints/ for more details. The --cpuset-cpus is the option that sets the specific CPU cores that the container can use.

We used to report the processor count based on the --cpus, but we have realized that in many cases, the Environment.ProcessorCount is used to detect maximum level of parallelism an application can run with and so we have changed it.

janvorli commented 4 years ago

See https://github.com/dotnet/coreclr/issues/26053 for more details.

amrmahdi commented 4 years ago

I see, makes sense. Would this have any impact on the runtime tasks like thread scheduling, GC etc ?

theolivenbaum commented 4 years ago

Just adding my two cents here - got hit hard by this today on updating to netcore3.1 on a 112-core server - where every pod running on this server had CPU limits set on Kubernetes, but now are behaving like they've access to all the 112 cores.

My impression is that decision to change how many cores the runtime considers seems to have been rushed out based on a single-instance benchmark and not on real-world usage of mixed pods in the same server.

Is there any documentation on how to properly set up CPU limits for netcore3.1 under Kubernetes with this change?

tmds commented 4 years ago

In https://github.com/dotnet/coreclr/pull/23398 we discussed this change and did not to merge based on lack of benchmark results.

https://github.com/dotnet/coreclr/pull/26153 then made the change without benchmark results.

stephentoub commented 4 years ago

cc: @VSadov

jkotas commented 4 years ago

benchmark results

We have seen performance results that show improvements both ways. For example, here is a graph that shows significant improvement after the clipping was removed: https://github.com/dotnet/coreclr/issues/22302#issuecomment-519457167.

As @janvorli described, --cpus=1 does not actually limit your app to a single core. It still allows your app to run on all cores, but slower. We have made the decision to remove the clipping based on the documented API behavior.

We may need to introduce new configuration settings or new APIs to address this properly.

tmds commented 4 years ago

CPU quota is the standard parameter of configuring Kubernetes containers. Before this change the quota got reflected in ProcessorCount (maybe semantically not correct), but it means applications were taking it into account. With the change, the cpu quota are no longer available and no longer used by .NET Core/.NET Core apps. The common case of a many containers with low cpu quota on multi-multi-core machine puts your app in a weird configuration (e.g 0.7 quota with ProcessorCount 122).

We may need to introduce new configuration settings or new APIs to address this properly.

+1 The cpu quota should be available in some form.

mrmartan commented 4 years ago

That is not true. CPU quota is OS level construct and .NET process is limited by it as any other process.

There was a discussion to put different variable in https://github.com/dotnet/coreclr/issues/26053#issuecomment-520691729

mrmartan commented 4 years ago

The common case of a many containers with low cpu quota on multi-multi-core machine puts your app in a weird configuration (e.g 0.7 quota with ProcessorCount 122).

I can see that being an issue though

theolivenbaum commented 4 years ago

My main worry right now is memory allocation - as this seems to affect the number of heaps on Server GC (https://docs.microsoft.com/en-us/dotnet/core/run-time-config/garbage-collector#systemgcheapcountcomplus_gcheapcount).

It is a massive jump to go from 1 core to 112 cores, and while we can still configure this with the COMPlus_GCHeapCount flag, it's quite surprising that this would be needed at all. If this is going to be a permanent change, I think a breaking change announcement would be in nice to have, and at least some documentation on how to control the behavior, specially because on Kubernetes it can be tricky (or impossible - haven't fully understood their docs yet on this) to set the --cpusets option to get the former expected behavior.

Had to sadly roll back to netcore3.0 for now till we've some clarity here :(

danmoseley commented 4 years ago

@sergiy-k does your team own this issue? Keeping track of issues impacting upgrade to 3.1.

sergiy-k commented 4 years ago

I think it would good to introduce a new runtimeconfig option that would tell the runtime whether it should treat the quota as a limit on processor count or not. Without input from developer, the runtime has no way of knowing what the intended behavior is and, regardless of the implemented policy, there always will be 50% of apps what work great and 50% that do not. @janvorli, @jkotas, @VSadov, what do you think?

VSadov commented 4 years ago

Depending on a container or orchestration used, there is generally a way to limit processor count (as in parallelism level - in addition or independently from quota). That has effect on the app regardless whether it is managed or native.

I am not sure we want another layer of similar settings. That does allow additional scenarios where two layers of settings contradict each other, but is that desirable?

sergiy-k commented 4 years ago

@VSadov, you said "there is generally a way to limit processor count". What is it? I think that "--cpuset-cpus" is not really suitable for this purpose because it sets hard affinity and limits ability of the OS scheduler to schedule threads efficiently.

VSadov commented 4 years ago

Yes the underlying scheduler does not seem to allow restricting number of concurrently active threads without also constraining the set of cores to run on. I am not sure if that is on purpose or just a limitation that happens to be.

It is still an interesting question - if you have access to 128 cores and share the machine with 126 similar tenants, do you want 2 heaps and 2 TP threads? Or 126 of each, or maybe 50? That really depends on how the apps interleave their use of the machine.

If we go with a setting, perhaps it should be more general - like a degree of parallelism expected. It could be in cores (1-128) or in fraction of total available ( 50%, etc..)

That could be useful even on a regular machine - if you run two apps and know that one will use 1 core continuously, you may want to tell another app to expect N-1 cores usable at any given time.

tmds commented 4 years ago

I think that "--cpuset-cpus" is not really suitable for this purpose because it sets hard affinity and limits ability of the OS scheduler to schedule threads efficiently.

And Kubernetes exposes cpu quota as the knob to control the cpu allocation.

Yes the underlying scheduler does not seem to allow restricting number of concurrently active threads without also constraining the set of cores to run on

Just guessing, but the scheduler may take into account cpu quota when allocating threads to cores (favoring locality).

tmds commented 4 years ago

Can we revert this for .NET Core 3.1 and revisit it for .NET 5?

richlander commented 4 years ago

There are three questions I see:

What is the correct behavior?
Is this change breaking and should it be reverted?
Is there a workaround and what is it?

I'll try to answer them, in order:

I would say that CPU quota is a very nebulous concept. It is a synthetic concept that is about capacity, not about actual cores. I think the new behavior is more correct, because it is oriented on actual cores. If you set --cpus=7.0 on a 64 core machine, you will be executing your app on >7 cores. cpuset-cpus is the concept that very clearly limits you to a particular set of cores. The runtime does the right thing with that setting.
This is obviously a breaking change. I think if we don't take this change now, we never will. I think we should accept the break because it is the best behavior and we definitely want .NET Core to be a container-native runtime. My take is that this is the behavior a container-native runtime should have. I read some posts and this only convinced me that there is no one behavior that works, so we should just go with reporting facts and then build more policy on top (not at the bottom).
Yes, there are workarounds. I think there are a few (in order of preference, IMO): align memory limits and CPU quota to be coherent, opt-in to work station GC, set GC heaps to 0 with the flag referenced earlier.

Note: There are my answers. There is no suggestion that they are not up for debate.

It would be great if folks running into this issue can try setting memory limits and see if that helps. That would be great input.

Related, we have been planning a container density investigation. We were hoping to do it in December, but it is likely to happen in January. I wish we had completed it already, so we'd have more data and experience with the scenario. I'm trying my best to "see around the corner" in absence of having done that exercise. In particular, I'm wanting to guess what I think the product default should be after we compete that exercise.

I really like where we landed for the memory limits default. It is super cut and dry and defendable. I'm having a lot of trouble defining a default for CPU quotas. @jkotas asked me what I thought the behavior should be at --cpus=7.0. I said that the behavior we have today is clearly correct. I immediately felt that we should have a different behavior for <=1.0 CPUs, that we should default to workstation GC, for example. In fact, that is what @Maoni0 and I discussed when we worked on the memory limits proposal (relative to how many GC heaps are created). After a bit more thought, a special behavior for <=1.0 CPUs gets really ugly as soon as you pop over to >1.0 CPUs. It's not smooth at all. That's bad. The nice thing about our memory limits behavior is that it scales smoothly.

I also thought about our upcoming density investigation and what we'll value. I suspect we'll want something more like this chart. We're planning on running n instances of TechEmpower on a big (sharded) machine. We want to use as MUCH cpu as possible. That's what I think our default should align with.

The behavior being requested feels more like an acquiesce sort of behavior. That's totally rationale and makes sense. It's just not what I think we should align with as a default behavior. We need to decide what the best way to configure an app for that behavior should be, and how that aligns with K8 configuration options.

Fair?

I just updated an existing sample to include cgroup info @ https://github.com/richlander/testapps/blob/master/versioninfo/Program.cs#L28-L33

Here is what it does being compiled and run on 3.0 and 3.1. I elided irrelevant info.

Scenario 1: 3.0 SDK with 1 CPU set

C:\git\testapps\versioninfo>docker run --rm --cpus=1 -m 60mb -v %cd%:/app -w /app mcr.microsoft.com/dotnet/core/sdk:3.0 dotnet run
**.NET Core info**
Version: 3.0.1

**Environment info**
ProcessorCount: 1

**CGroup info**
cfs_quota_us: 100000
memory.limit_in_bytes: 62914560
memory.usage_in_bytes: 62877696

Scenario 2: 3.0 SDK with 1.5 CPU set

C:\git\testapps\versioninfo>docker run --rm --cpus=1.5 -m 60mb -v %cd%:/app -w /app mcr.microsoft.com/dotnet/core/sdk:3.0 dotnet run
**.NET Core info**
Version: 3.0.1

**Environment info**
ProcessorCount: 2

**CGroup info**
cfs_quota_us: 150000
memory.limit_in_bytes: 62914560
memory.usage_in_bytes: 62812160

Scenario 3: 3.1 SDK with 1 CPU set

C:\git\testapps\versioninfo>docker run --rm --cpus=1 -m 60mb -v %cd%:/app -w /app mcr.microsoft.com/dotnet/core/sdk:3.1 dotnet run
**.NET Core info**
Version: 3.1.0

**Environment info**
ProcessorCount: 2

**CGroup info**
cfs_quota_us: 100000
memory.limit_in_bytes: 62914560
memory.usage_in_bytes: 62873600

Scenario 4: 3.1 SDK with CPU affinity set (to 1 core)

C:\git\testapps\versioninfo>docker run --rm --cpuset-cpus=0 -m 60mb -v %cd%:/app -w /app mcr.microsoft.com/dotnet/core/sdk:3.1 dotnet run
**.NET Core info**
Version: 3.1.0

**Environment info**
ProcessorCount: 1

**CGroup info**
cfs_quota_us: -1
memory.limit_in_bytes: 62914560
memory.usage_in_bytes: 62844928

normj commented 4 years ago

To add some more data here. The AWS SDK for .NET has some caching logic for the number of pooled HttpClients we use based on the Environment.ProcessorCount property. This logic is left over from the .NET Core 1.0 performance analysis days and might not be necessary anymore but this change can cause a dramatic increase in the number of HttpClients used in a small container on a large machine. I preferred the old way as it gave us a generic rough estimate on the compute a library could use without having to special checks if I'm in docker, VM, real metal or Lambda.

richlander commented 4 years ago

Another idea is to add an opt-in experience along the lines of: <ConservativeProcessorCount>true</ConservativeProcessorCount>

Anyone think that's a good idea?

I'm still sticking to the idea that the current 3.1 behavior is the best default. Basically, someone is going to be unhappy either way, so I'd rather go with the most accurate behavior/value. Fair?

tmds commented 4 years ago

I think reverting this change should be considered because of:

backwards compatibility: the value returned can be of a completely different magnitude
usability: cpu quota is the knob Kubernetes gives you, and this is no longer available

The change was made not because there was a functional issue, but to improve performance. The benchmarking performed is insufficient to validate performance does not regress when ProcessorCount returns much higher values than before.

correctness

fwiw, cpu quota and effective available cores aren't completely orthogonal.

I think if we don't take this change now, we never will.

We can do this in .NET 5. And include the necessary additional properties/configuration flags/....

we have been planning a container density investigation.

:+1: :+1:

mrmartan commented 4 years ago

Another idea is to add an opt-in experience along the lines of: <ConservativeProcessorCount>true</ConservativeProcessorCount>

Such an option would give us an ability to switch between what is IMHO legacy netfx behavior (present in netcore <3.1) where CLR assumes it owns the machine, which was fine in dedicated server/VM era, and the new behavior that is more accurate/true and suitable for containers/shared machines.

Would it be a bad idea to reuse DOTNET_RUNNING_IN_CONTAINER for this? Another idea - could Environment.ProccessorCount be exposed as a knob to let developers override the assumptions made based on cpus, quotas or whatever.

There are many things tied to Environment.ProccessorCount. We do some thing based on the value reported. As definitely other customers do. CoreCLR itself as well. I'd agree with @tmds, that the default behavior should not be changed as it is done in 3.1 right now.

richlander commented 4 years ago

Another idea - could Environment.ProccessorCount be exposed as a knob to let developers override the assumptions made based on cpus, quotas or whatever.

Can you elaborate on this? Or how is this different than than the <ConservativeProcessorCount>true</ConservativeProcessorCount> suggestion?

tmds commented 4 years ago

Such an option would give us an ability to switch between what is IMHO legacy netfx behavior (present in netcore <3.1) where CLR assumes it owns the machine, which was fine in dedicated server/VM era, and the new behavior that is more accurate/true and suitable for containers/shared machines.

3.0 is taking into account cpu quota instead of using the number of cores of the physical machine. This makes .NET Core container aware. I don't understand why you call this 'legacy netfx' behavior.

Can you elaborate on this? Or how is this different than than the true suggestion?

@mrmartan is proposing to keep the default behavior as 3.0 and add a way to directly control the value returned by Environment.ProcessorCount (e.g. an envvar). It allows users to experiment, and to figure out the 'best' value for ProcessorCount.

I think that makes sense as a 3.1 config knob.

My main concern is about the changed default behavior.

richlander commented 4 years ago

I took a look at what golang does. I believe that this is their implementation @ https://github.com/golang/go/blob/8174f7fb2b64c221f7f80c9f7fd4d7eb317ac8bb/src/runtime/os_linux.go#L71

It seems be similarly policy-free as the new behavior in 3.1. Do I have that right?

My main concern is about the changed default behavior.

I get it. I'm super hesitant to just take the fix to return the old behavior w/o defining the best behavior. Obviously, we wouldn't be having this conversation if the 3.0 behavior was still in place. We'd only be talking about 5.0, as you are advocating. Now that the behavior has been changed, and for such an important scenario, it makes sense to think about what the right behavior is.

tmds commented 4 years ago

It seems be similarly policy-free as the new behavior in 3.1. Do I have that right?

This is an interesting read https://threadreaderapp.com/thread/1149654812595646471.html.

TL;DR Google and Uber use cgroup info to set GOMAXPROCS. And: you need to measure to know what makes sense.

I'm super hesitant to just take the fix to return the old behavior w/o defining the best behavior.

Do you feel confident the new behavior is better? (I don't)

janvorli commented 4 years ago

My view is that the current implementation returns the truth. If you are running on a device with 112 CPUs, then even with the quota, you can still be running on 112 different cores over time. So the environment really has 112 different CPUs, thus the Environment.ProcessorCount reflects that.
But for some cases, it seems it is also important for applications to be able to query the current quota. So I believe we should expose it on the Environment. It also seems we should think again about the number of GC heaps and threadpool parameters with relation to the CPU count and quota. I don't have a clear picture on how these should be related yet. Maybe the number of GC heaps should take the quota into account, but I am not sure.

tmds commented 4 years ago

We should look at how this value is/should be used, to decide what it best returns.

mrmartan commented 4 years ago

3.0 is taking into account cpu quota instead of using the number of cores of the physical machine. This makes .NET Core container aware.

That does not make it container aware. From my point of view it just lies to itself for its own benefit. There is a significant difference between the amount of CPU time a process is allowed to consume and the number of threads it is allowed to execute in parallel. .NET Core <3.1 ties those two things together. Moreover even with Environment.ProccessorCount == 1 multiple threads can be executing in parallel (as OS would still schedule them to multiple CPUs). Environment.ProccessorCount == 1 also forces a GC mode that is not overridable by any means (correct me if I am wrong), but I need to be able to use Server/Background GC even when CPU quota is <1500m (Environment.ProccessorCount == 1)

As I tried to explain in my original post (which more or less started all this), we are running a bunch of services that are running horizontally scaled (3-5) on K8s for the sake of resiliency/availability and none of them consume more than 1000m (put imprecisely , not even one whole CPU) but are highly parallel, bursty, I/O applications. We are forced by netcore 2.2/3 to run them with CPU quota set to 2000m just to keep Environment.ProccessorCount > 1. With Environment.ProccessorCount == 1 they grind to a halt under load (even though they would consume in some cases only 300m). Without CPU quota, thus Environment.ProccessorCount == 64 (in our case), they would consume ridiculous amounts of memory (due to heaps created) and I assume is the same behavior we would get with netcore 3.1. I guess we might be able to control this with knobs currently available but I don't want to be forced to configure and maintain all of this per application. I want CoreCLR to be smart about it, possibly with some hints from the developer.

brendandburns commented 4 years ago

There's an important difference here between CPU shares and limit. This is the same as request and limit in Kubernetes or --cpus and --cpu-limit in Docker.

--cpus/k8s request means "this is what I think I need" vs --cpu-limit/k8s limit this is where I should be capped/throttled.

I don't think dotnet should set processor count based on --cpus/k8s request. It is understood that a process can exceed this limit (at the system's potential peril) in order to drive up utilization.

When it comes to --cpu-limit/k8s limit it's a little more complicated. That value is the maximum number of CPU-microseconds that the scheduler will allow the process, but of course that doesn't say anything about maximum parllelism that is a good/performant idea.

My feeling is that in any high performance system, the degree of parallelism that works "well" is going to be highly dependent on the workload/code. But of course you probably want some heuristic value too.

So my recommendation would be: a) Have some default max_threads value in dotnet that libraries can use as the heuristic "best guess" value b) Set this max_threads based on some heuristic blend of --cpus and --cpu-limit c) Enable users to override this value via environment variable or flag if they want to see something different. d) Encourage library developers (e.g. HTTP Client) to also make their particular "HTTP_MAX_THREADS" or whatever configurable by the end user.

Basically, mixing # cores/cpu limit and max parallelism together is kind of a dangerous idea. It's ok for the 80% use-case but I think it will fall down at high scale.

tmds commented 4 years ago

With 3.0 ProcessorCount is (ab)used as the max_threads and the heuristic is to calculate it based on --cpus. I agree that ProcessorCount may not be semantically correct for this. And that this heuristic can be improved. The 3.1 change to return the #cores leaves us without any max_thread.

amrmahdi commented 4 years ago

I've setup a small app (high_cpu_repro.zip) to repro the high CPU consumption between 3.0 and 3.1.

Test Setup

Environment

Kubernetes v1.14 cluster running on azure
VM sku: Standard_DS13_v2 (8 cpus & 56 G of memory)

Limits

cpu: limits & requests set to 1 core
memory: limits & requests set to 1 G

Server

Basic greeter GRPC service using @davidfowl sample GRPC service + extra processing to simulate load
Server GC set to true

Client

Sends 20 concurrent unary GRPC messages of strings (16K per message)

Results

NET Core 3.0

Sever usage stabilizes at 80% of the 1 core

NET Core 3.1

Server consumes all available CPUs

This is just a simplified example. On on our actual service on production load when we deployed 3.1 to production, the cpu went from < 60% again to 100% (of 3 cores on a 8 core VM)

So at this point 3.1 is not really usable on kubernetes, and blocks us from consuming all improvements in 3.1.

I agree with @tmds that this is a breaking change that should be reverted for 3.1 until a better complete solution is designed and in place that makes NET Core container aware.

cc @davidfowl @brendandburns

jkotas commented 4 years ago

@amrmahdi What does the graph look like when you switch to workstation GC on .NET Core 3.1?

brendandburns commented 4 years ago

Your graph seems to show 3.1 consuming 1 core (Y axis is 1.0) which is expected since you set the CPU limit to 1 core. What happens if you set CPU limit to 0.8 for dotnet 3.1

Unless, your y-axis in the monitoring query is 1 == 100% of all CPUs, but in that case the 3.0 one is using 80% of all CPUs not 80% of one CPU.

jkotas commented 4 years ago

Your graph seems to show 3.1 consuming 1 core (Y axis is 1.0) which is expected since you set the CPU limit to 1 core.

Correct. The server GC (the default for ASP.NET apps) is optimized to maximize requests per second and minimize latency. It will try to use as much processor cycles and memory as it is allowed to. If you would like to optimize CPU cycles / request (at the cost of lower requests per second and higher latency), the workstation GC may be better option for you.

tmds commented 4 years ago

cpu: limits & requests set to 1 core

With 3.0 you get workstation GC with these limits. With 3.1 you get server GC.

VSadov commented 4 years ago

From the graph it looks like 3.0 had underutilization issues. The limit was 1 core and it used 0.8.

Did the throughput and latency of the app change between 3.0 and 3.1?

amrmahdi commented 4 years ago

@amrmahdi What does the graph look like when you switch to workstation GC on .NET Core 3.1?

It went up to 80% for a while then jumped to 100% and stayed there.

Your graph seems to show 3.1 consuming 1 core (Y axis is 1.0) which is expected since you set the CPU limit to 1 core. What happens if you set CPU limit to 0.8 for dotnet 3.1

Unless, your y-axis in the monitoring query is 1 == 100% of all CPUs, but in that case the 3.0 one is using 80% of all CPUs not 80% of one CPU.

This is a sample of the query I'm using

sum(rate(container_cpu_usage_seconds_total{namespace="netcoreperf", container_name="netcore31", pod="netcore31-6cdff86f94-lsx6g"}[5m]))

CORRECTION: The graph is not really showing percentage of the cpu limit. The y-axis is showing the fraction of cores used.

VSadov commented 4 years ago

If the same load was used for both scenarios (sending a burst of 20 messages), then the graphs seem to indicate that 3.1 dealt with requests 50% faster, while consuming 25% more CPU, which it was allowed to take by the config.

Since you want to spend only 0.8 cpus, what happens on 3.1 if you set 0.8 as a limit?

brendandburns commented 4 years ago

I don't think the y-axis is showing percentage of cores or fraction of cores (at least not based on that prometheus query), it is showing absolute # cores.

That prometheus query is showing sum of CPU seconds which should be total # of cores. Given that, it appears that dotnet 3.1 is performing as expected and using a full 1.0 CPU core (possibly spread across multiple physical cores)

tmds commented 4 years ago

I'd expect performance to be better with 3.1, especially if there is no contention on the server. The more interesting benchmark is to see what happens if you run a lot of containers simultaneously on a multi-multi-core server.

.NET Core 3.1 containers are not cpu quota aware:

gc: when no memory limit is set, the number of heaps will be equal to #cores
server gc is used instead of workstation gc, because ProcessorCount > 1
code using ProcessorCount as a measure for parallelism is off for low cpu quota

Looking at .NET Core 3.0 implementation, I guess assuming a bit more parallelism (especially for low cpu quota) will yield better performance. Going all the way to machine #cores may be overdoing it. Benchmarks should give the answer.

VSadov commented 4 years ago

ProcessorCount is generally used as a limiting factor. Scaling beyond that is often counterproductive, so it is used as hard/soft upperbound.

The actual parallelism of the app should be sensed from the app behavior. Admittedly, scaling to parallelism is not always as dynamic as it could be. There is room for improvement, but it is dynamic in many cases.

ThreadPool, for example will not immediately create ProcessorCount threads. It will set the initial limit to that value, the actual number of threads will scale depending on the load. And the limit will be adjusted further based on utilization feedback.

GC is less dynamic here. Many parts will scale with the load (allocation contexts are per-thread, for example, lots of feedback based behaviors, etc), but some parts are more static - i.e. the number of heaps. If number of heaps outnumbers running threads it is not a huge problem though. Heap balancing/stealing will deal with that and will be fairly cheap in this case.

If initial memory allocation is a concern, there are configs specifically for that.

Limiting parallelism artificially, based on cycle quotas works sometimes, but often for wrong reasons.

If there are hoarding issues, the things should be made more dynamic or have separate config knobs.

amrmahdi commented 4 years ago

If the same load was used for both scenarios (sending a burst of 20 messages), then the graphs seem to indicate that 3.1 dealt with requests 50% faster, while consuming 25% more CPU, which it was allowed to take by the config.

I'm not sure if this is entirely true. In my tests I observed higher throughout for 3.1. I even ran the same test with 3 cores per container

	test duration (seconds)	limit cpus	peak cpu	reqs	cpu per req
3	600	1	0.8	145907	0.00328977
3.1	600	1	1	87157	0.00411221

3	600	3	1.3	161400	0.00534587
3.1	600	3	1.7	155091	0.00699075

I don't see huge difference between server and workstation GC settings for both of the tests. I'm not sure what could have change between the 2 versions that affected this.

richlander commented 4 years ago

Just an update. Thanks for all this data/reports. It is super helpful. We conducted a quick investigation and are seeing similar results. We are starting a more formal investigation now.

Please feel free to share more data. We will look at it. However, our focus will now turn to more deep performance analysis. I hope to have more info to share soon (although the upcoming holidays may slow us down).

At this time, and with the information I have available, I would say that 3.0 is a better choice in CPU-limited containers. Please test 3.1 to ensure it is meeting your performance goals if you deploy it in production with CPU-limits.

splusq commented 4 years ago

Glad to see some agreement here. Upgrade to 3.1 directly impacted us.

richlander commented 4 years ago

Here are some relevant links for folks looking at this:

richlander commented 4 years ago

A small group of folks have been working on this problem and have been running tests all week to root cause the issue. We think we're now very close.

We have a new build to test: richlander/aspnet @ https://hub.docker.com/r/richlander/aspnet

This is the same content (modulo a fix) as: mcr.microsoft.com/dotnet/core/aspnet:3.1.0-buster-slim

I tested that it works functionally. I built the container image in an ad hoc way (not through our regular infrastructure), so it's possible there are bugs in that. Please report any strange behavior you see.

Here's the short version of what to expect:

Environment.ProcessorCount is still the same as 3.1, reporting processor count for the whole machine.
The CoreCLR thread pool min threads value is set to the same as the CPU quota.
epoll threads have been reduced to 1.

PR for change: https://github.com/dotnet/coreclr/pull/27975

@VSadov will correct / expand on the details.

tmds commented 4 years ago

Let's consider 3 aspects which are important: correctness, performance, and backwards-compatibility.

This change is not backwards compatible.

Performance has regressed, which is why additional changes are being proposed in .NET Core. Code outside .NET Core is affected also. For example, the ASP.NET Libuv transport uses ProcessorCount in a similar way as it affected the epoll threads and threadpool. For the additional change that is now being proposed for the nr of epoll threads. I've made a PR to set this to 1 previously, but it was not merged because performance was worse than at ProcessorCount.

Overall we don't know if performance has now been improved/regressed. The main driver for pushing the change seems correctness. Imo performance and backwards-compatibility are much more important than correctness.

dotnet / runtime