Runtime.availableProcessors() ignores cgroup limits

thjaeckle commented 6 years ago

When running OpenJ9/OpenJDK8 in Docker (e.g. via image from adoptopenjdk) a call to Runtime.availableProcessors() returns the amount of processors of the Docker host system and not the one defined in cgroup /sys/fs/cgroup/cpu/cpu.cfs_quota_us.

Hotspot has/had the same problem: https://bugs.openjdk.java.net/browse/JDK-6515172 which they fixed in 9 and apparently also backported to 8.

For Java 10 they add even better Docker support: https://bugs.openjdk.java.net/browse/JDK-8146115

As OpenJ9 really shines in cloud environments a better "out of the box" Docker container resource usage would be awesome.

pshipton commented 6 years ago

@ashu-mehra @dinogun fyi

ashu-mehra commented 6 years ago

In addition to Runtime.availableProcessors(), it appears Hotspot has also updated JVMTI API GetAvailableProcessors() to return same value as Runtime.availableProcessors(). I created a docker image using build 10-ea+40 and tested the output of Runtime.availableProcessors() and GetAvailableProcessors():

1) Using cpusets:

# docker run --rm --cpuset-cpus=0-3 --volume=/home/ashu/jvmtiTest:/jvmtiTest --entrypoint=/java/jdk-10/bin/java openjdk:jdk10_40 -agentpath:/jvmtiTest/libagent.so -cp /jvmtiTest ProcessorTest
jvmtiGetAvailableProcessors returned: 4
Runtime.availableProcessors: 4

2) Using cpu quota:

# docker run --rm --cpuset-cpus=0-3 --cpus=2 --volume=/home/ashu/jvmtiTest:/jvmtiTest --entrypoint=/java/jdk-10/bin/java openjdk:jdk10_40 -agentpath:/jvmtiTest/libagent.so -cp /jvmtiTest ProcessorTest
jvmtiGetAvailableProcessors returned: 2
Runtime.availableProcessors: 2

DanHeidinga commented 6 years ago

OpenJDK has also added a -XX:ActiveProcessorCount=count option that enables setting the number of processors returned by Runtime.availableProcessors(). We should support the option as well.

ashu-mehra commented 6 years ago

Quoting statement from this bug report regarding -XX:ActiveProcessCount=count option:

-XX:ActiveProcessorCount=count option allows a user to override the number of processors the VM will use when creating threads for various subsystems. This option is available on all currently supported operating systems.

babsingh commented 6 years ago

fyi - @Ali-Ni. Study this task/issue and investigate the VM changes required.

babsingh commented 6 years ago

fyi - @tajila

anikser commented 6 years ago

With regards to the cgroup processor limit issue:

We need to consider that cgroup limiting of cpu.cfs_quota_us and cpu.cfs_period_us (for example through docker's --cpus=<value> option) has less to do with the actual number of physical processors available, but rather configures the CFS scheduler to impose a limit on the fraction of time the processors spend on all the tasks in the cgroup before throttling. So, setting --cpus=1.5 is a valid option. Currently, Hotspot's Runtime.availableProcessors() function seems to round this "equivalent number of CPUs" to the nearest integer (based on testing with 10+46 nightly build).

The cpuset.cpus parameter, controlled in docker by --cpuset-cpus, specifies the physical CPUs that the tasks in the cgroup can actually run on, and so represents the physical number of CPUs available. This is already handled by OpenJ9, and Runtime.availableProcessors() behaves as expected, the same as Hotspot (10+46 nightly).

With this in mind: We can update omrsysinfo_get_number_CPUs_by_type in the OMR Port Library, changing the OMRPORT_CPU_ONLINE (and maybe OMRPORT_CPU_BOUND?) flag to be capped at the equivalent number of CPUs that the cgroup cpu subsystem limits the process to. Alternatively, another ‘CPU type’ flag could be created to satisfy this definition.

Another option is to write a whole new port library function in omrsysinfo.c that checks the cgroup cpu subsystem limit. Both a new flag for the existing function and a new function can be intentionally used wherever needed without affecting any other part of the VM.

tajila commented 6 years ago

Alternatively, another ‘CPU type’ flag could be created to satisfy this definition.

I am in favour of this approach

DanHeidinga commented 6 years ago

Which functions would be updated to read the new CPU type? What would the type be and how would we document when to use it vs online / bound /etc?

anikser commented 6 years ago

The type would be of an "equivalent number of CPUs", and could be labelled OMRPORT_CPU_EQUIVALENT (also add J9PORT_CPU_EQUIVALENT in openj9). It would be documented as something along the lines of "Number of equivalent CPUs that the process is limited to based on CFS scheduling constraints." We should take care to indicate that it is not representative of physical CPUs, and should consider detailing where this value comes from (cpu.cfs_quota_us / cpu.cfs_period_us) to avoid confusion.

jvmtiGetAvailableProcessors() is the only function that would need to be updated, as far as I can tell (called by Runtime.availableProcessors()) - currently it calls j9sysinfo_get_number_CPUs_by_type(J9PORT_CPU_ONLINE)

anikser commented 6 years ago

Alternatively OMRPORT_CPU_EQUIVALENT_CFS

tajila commented 6 years ago

"Number of equivalent CPUs that the process is limited to based on CFS scheduling constraints."

What will this option report when the JVM is not running in a docker environment?

anikser commented 6 years ago

If this option is being defined entirely based on cgroup CFS scheduling constraints, we should return the equivalent number of CPUs based on the cgroup cpu subsystem parameters, regardless of whether or not it is fully containerized (i.e. in Docker). If this is done, however, manually running the JVM in a cgroup with these parameters defined would similarly affect what is returned.

If it is running in an environment that does not support cgroups, an error can be thrown, or it could return omrsysinfo_get_number_CPUs_by_type(OMRPORT_CPU_ONLINE). If we decide that an error should be thrown, then cgroup support would have to be handled in jvmtiGetAvailableProcessors().

tajila commented 6 years ago

If it is running in an environment that does not support cgroups, an error can be thrown, or it could return omrsysinfo_get_number_CPUs_by_type(OMRPORT_CPU_ONLINE).

I think providing a reasonable fallback would be more ideal than returning an error.

Currently, Runtime.availableProcessors() returns j9sysinfo_get_number_CPUs_by_type(J9PORT_CPU_TARGET); as long as its >= 1. This seems more correct that what jvmtiGetAvailableProcessors currently does.

According to the docs, OMRPORT_CPU_TARGET is OMR_MIN(BOUND, ENTITLED).

ENTITLED: is the number of CPUs that the user has specified. This is probably what -XX:ActiveProcessCount=count will set. BOUND: is the number of physical CPUs bound to the process

One approach is to create a "container aware" version of OMRPORT_CPU_BOUND, perhaps OMRPORT_VIRTUALIZATION_CPU_BOUND. This basically behaves like OMRPORT_CPU_BOUND except it takes resource limits like cpu.cfs_quota_us into account as well. And, this can fallback to OMRPORT_CPU_BOUND if the current JVM instance is not running in a virtual environment (docker, cgroups, etc.). Or we can make OMRPORT_CPU_BOUND container aware by default, but then we have to double check sure that it doesn't break anything.

Also, is docker numa aware? What happens if options like --cpunodebind or --physcpubind are used in combination with docker options --cpuset-cpus= and --cpus=?

ashu-mehra commented 6 years ago

I think changes for Runtime.availableProcessors() and jvmtiGetAvailableProcessors should be protected by -XX:[+-]UseContainerSupport option. Behavior in hotspot is also based on this option, which is enabled by default. For OpenJ9 we have not yet enabled it by default, but we should do that once things are deemed stable.

ashu-mehra commented 6 years ago

we can make OMRPORT_CPU_BOUND container aware by default, but then we have to double check sure that it doesn't break anything.

This will certainly change the number of GC threads. I have run some benchmarks to understand the impact of changing GC threads but the results are not conclusive enough. I am checking with Dmitri and Aleks to decide if we want to tune GC threads or not. Number of JIT compilation threads is also something which can be tuned based on cpu quota, and as per my discussion with @mpirvu it is okay to tune that. Currently on Linux JIT always creates 7 threads

mpirvu commented 6 years ago

It should be noted that while the JIT always creates 7 compilation threads they may not all be used. On Linux a non-root user cannot raise the priority of a thread. Hence, it is possible that compilation threads are starved by the presence of many application threads. The JIT has some logic to detect starvation and then it will launch additional compilation threads as needed.

tajila commented 6 years ago

Thanks @ashu-mehra, I wasn't aware of the -XX:[+-]UseContainerSupport option.

Is it safe to assume that if container support is disabled (ie. -XX:-UseContainerSupport) then hotspot's behaviour is equivalent to what we are currently doing? In order words, the real issue here is that we don't yet support this option?

ashu-mehra commented 6 years ago

Is it safe to assume that if container support is disabled (ie. -XX:-UseContainerSupport) then hotspot's behaviour is equivalent to what we are currently doing?

Right, if you use -XX:-UseContainerSupport would disable container awareness in Hotspot. This option not just influences Runtime.availableProcessors() but also few other internal JVM parameters like GC and JIT thread counts and heap size (and probably some other). See this for more details.

We have added this option in OpenJ9 but it is only being used for enabling container awareness with respect to memory limits. Relevant PRs:

ashu-mehra commented 6 years ago

It should be noted that while the JIT always creates 7 compilation threads they may not all be used. On Linux a non-root user cannot raise the priority of a thread. Hence, it is possible that compilation threads are starved by the presence of many application threads. The JIT has some logic to detect starvation and then it will launch additional compilation threads as needed.

So it appears JIT is makes a new compilation thread active only when there is requirement which depends on multiple factors such as number of active threads, compilation backlog, compilation thread starvation. Given that currently the decision of creating 7 threads on Linux is taken irrespective of number of cpus available, I am now thinking why should we change it when using cpu quota? The existing approach should work as it is, since it is agnostic to available cpus. When there is no starvation, we do not activate threads more than (getNumTargetCPUs() - 1) which translates to OMRPORT_CPU_TARGET - 1. So, if we just ensure getNumTargetCPUs() returns value based on cpu quota, then I think existing mechanism should work fine. @mpirvu what do you think?

mpirvu commented 6 years ago

I agree that that all we need to do is to have getNumTargetCPUs take into the consideration the CPU quota.

anikser commented 6 years ago

Just a note: In Hotspot, Runtime.availableProcessors() does take into account cgroup CPU quota when limited by a user defined cgroup outside of a container. If -XX:-UseContainerSupport is used however, this behaviour is disabled.

anikser commented 6 years ago

A solution could be to change OMRPORT_CPU_BOUND to check if the cgroup CPU subsystem is enabled, which indicates that -XX:+UseContainerSupport is set (similar to what is done in sysinfo_get_physical_memory). If it is, then OMRPORT_CPU_BOUND would return the minimum between what it currently returns and the CPU quota.

In Hotspot, -XX:ActiveProcessorCount=count overrides all other forms of number of CPU detection, and ignores all resource limits, including cgroup cpuset and cpu quota, and even the physical number of cpus available:

root@355581430b0a:~# jdk-10+46/bin/java -XX:ActiveProcessorCount=100000 test
[4.589s][warning][os,thread] Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 4k, detached.
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Cannot create worker GC thread. Out of system resources.
# An error report file with more information is saved as:
# /root/hs_err_pid1075.log

To implement this, -Xentitledcpus could be removed and replaced with this new flag. OMRPORT_CPU_TARGET could then be changed to return OMRPORT_CPU_ENTITLED if it is set, OMRPORT_CPU_BOUND if it isn't.

Upon closer inspection, there are many cases in the JVM where J9PORT_CPU_ONLINE is used to determine the number of available CPUs. We may want to consider reevaluating these cases and deciding if CPU quota and/or number of entitled CPUs set by the user should be taken into consideration.

anikser commented 6 years ago

@dmitripivkine can you look at my previous comment and give your thoughts on oracle's implementation of -XX:ActiveProcessorCount? Do we want to copy this behavior?

DanHeidinga commented 6 years ago

@Ali-Ni Is this work complete?

anikser commented 6 years ago

Cgroup cpu quota checking and -XX:ActiveProcessorCount have been merged. I think other future container awareness improvements can be delegated to other issues.

DanHeidinga commented 5 years ago

This has gone stale. Closing

eclipse-openj9 / openj9

Runtime.availableProcessors() ignores cgroup limits #1166