ERL-927: CGroup cpu shares not taken into account when # of schedulers is chosen

OTP-Maintainer commented 5 years ago

Original reporter: tsloughter Affected version: Not Specified Fixed in version: OTP-23 Component: erts Migrated from: https://bugs.erlang.org/browse/ERL-927

When using CFS quota and period to limit a containers cpu usage, such as with `docker run --cpus 1 ...` as of Docker 1.3 (https://blog.docker.com/2017/01/cpu-management-docker-1-13/), the limits are not taken into account when creating schedulers.

Using something like `--cpuset-cpus 0,1` to only making cpus 0 and 1 available the Erlang node does properly limit itself to 2 online schedulers.

Assuming `--cpus` does limit the number of parallel tasks the VM should also limit the number of online scheduler. Based on the RedHat documentation, https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/resource_management_guide/sec-cpu#sect-cfs:

bq. Note that the quota and period parameters operate on a CPU basis. To allow a process to fully utilize two CPUs, for example, set cpu.cfs_quota_us to 200000 and cpu.cfs_period_us to 100000. 

It sounds like it is limited to use of individual cpus depending on the values and thus additional schedulers would not be running in parallel.

My proposal is based on the related Java ticket and patch https://bugs.openjdk.java.net/browse/JDK-8146115:

bq. Use a combination of number_of_cpus() and cpu_sets() in order to determine how many processors are available to the process and adjust the JVMs os::active_processor_count appropriately. The number_of_cpus() will be calculated based on the cpu_quota() and cpu_period() using this formula: number_of_cpus() = cpu_quota() / cpu_period(). If cpu_shares has been setup for the container, the number_of_cpus() will be calculated based on cpu_shares()/1024. 1024 is the default and standard unit for calculating relative cpu usage in cloud based container management software. 

So using the ceiling of cpu_quota() / cpu_period() to define the number of active schedulers to run by default.

OTP-Maintainer commented 5 years ago

tsloughter said:

Related kernel docs: https://www.kernel.org/doc/Documentation/scheduler/sched-bwc.txt It includes examples showing how the limits restrict how many CPUs are available.

OTP-Maintainer commented 5 years ago

lukas said:

{quote}Limit a group to 1 CPU worth of runtime.{quote}

They talk a lot about 1 CPU worth of runtime in that documentation. To me that seems like you still get N CPUs for 1/N time units when you run.

If I start {{docker run -it --rm --cpus 4 erlang}} and then do this:

{code}
1> [spawn(fun F() -> lists:seq(1,19000), F() end) || _ <- lists:seq(1,10)].
{code}

The CPU utilization of my system ends up at 50% for each core, it does not come to 100% for 4 cores and 0% for the other. If I just start 1 process instead of 10 then I get 1 CPU at 100%. 

This would suggest to me that the CFS does give the system access to 8 cores, even if only 4 CPUs worth of runtime is available. So in a system that uses 50% of the allowed CPU resources, it would be possible to get 8 parallel threads doing work.

OTP-Maintainer commented 5 years ago

tsloughter said:

Hm, damn, true.

OTP-Maintainer commented 5 years ago

tsloughter said:

So it may just be a matter of documentation and not a good idea to limit schedulers based on this. I would think having an option to automatically limit based on the quota/period to be useful, for the times people prefer to optimize for throughput over latency. But based on my looking at the code and the java patch it doesn't appear that simple to add, in which case it likely isn't worth it when a user can simply set the number of schedulers based on the limits they are setting for their container.

Unless you think the scheduler option is worth doing I'll mark this as resolved.

OTP-Maintainer commented 5 years ago

tsloughter said:

I guess also a more complete verification would involve multiple processes with limits set. On an 8 core system and 2 nodes each with limits of `--cpus 4` how are they spread across cpus and scheduled? Does it switch between each across all cpus or does it end up scheduling each to half of the cores.

OTP-Maintainer commented 5 years ago

tsloughter said:

Another issue is throttling. Having 8 schedulers with a quota of 4 "cpus" I think you are more likely to reach the quota before the next cfs period, resulting in throttling. I'm still a little hazy on this aspect.

OTP-Maintainer commented 5 years ago

tsloughter said:

Has there been any more internal discussion on this on the team?

Are numbers needed to show that a scheduler per core when quota is restricted leads to CFS throttling for this change to be considered?

OTP-Maintainer commented 5 years ago

lukas said:

{quote}Has there been any more internal discussion on this on the team?{quote}

No, not really.

{quote}Are numbers needed to show that a scheduler per core when quota is restricted leads to CFS throttling for this change to be considered?{quote}

I'm going back and forth in what would be the best here, but I think I've ended up thinking that it would be a good idea to restrict the number of online schedulers based on the cfs quotas. It is not obvious that it is the optimal choice, but it is what the user expects when using docker and that is the most common use-case when the quotas are used.

OTP-Maintainer commented 5 years ago

tsloughter said:

Ok, great. If I can help in any way just let me know.

OTP-Maintainer commented 5 years ago

john said:

First stab at fixing this: https://github.com/jhogberg/otp/commits/john/erts/container-tweaking/OTP-16105/ERL-927

I've decided to ignore {{cpu.shares}}; if I'm reading the docs right it's a weight saying how much CPU time we get relative to other processes when the system is constrained, so limiting the number of schedulers based on that doesn't feel right.

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/resource_management_guide/sec-cpu#sect-cfs

OTP-Maintainer commented 5 years ago

john said:

I've merged the changes into {{master}}, thanks for bringing this up!

OTP-Maintainer commented 5 years ago

tsloughter said:

Wooo! :)

erlang / otp

ERL-927: CGroup cpu shares not taken into account when # of schedulers is chosen #3883