eclipse-omr / omr

Eclipse OMR™ Cross platform components for building reliable, high performance language runtimes
http://www.eclipse.org/omr
Other
948 stars 396 forks source link

Container and cgroupv2 recognition failing OOMKiller terminating pod #6935

Open jdekonin opened 1 year ago

jdekonin commented 1 year ago

In the course of investigating a customer problem of OOMKiller terminating a pod, it was noticed that running in a container and cgroup v2 limits are not being detected as expected. The container is being launched from an AKS 1.25.5 environment, with memory constraints of 512M using a Liberty container websphere-liberty:22.0.0.13-kernel-java11-openj9 version 22.0.0.9 which contains Semeru OE 11.0.17. I believe this maps to OMR sha 90a1bade which was Openj9 tag/release openj9-0.35.0

From a javacore

1CICONTINFO    Running in container : FALSE
1CICGRPINFO    JVM support for cgroups enabled : FALSE

Further debugging in the failing environment that /sys/fs/cgroup/memory.max shows the expected value, and stat -c %T -f /sys/fs/cgroup returns the expected value of cgroup2fs, but when the JVM starts it has a max heap of 30+GB. Files /.dockerenv and /run/.containerenv do not exist in the environment, so container recognition in this use case is not enough. Using the option -XX:+/-UseContainerSupport would seem appropriate, but that doesn’t appear to have any impact. It would appear from a web search [1] that AKS supports three different container types: docker, CRI-O, and containerd, with containerd the default since 1.19 and that a possible workaround [2] is creating a TESTCONTAINERS_HOST_OVERRIDE environment variable which has not been confirmed as working.

$ stat -c %T -f /sys/fs/cgroup
cgroup2fs
$ cat /sys/fs/cgroup/cgroup.controllers
cpuset cpu io memory hugetlb pids rdma misc
$ cat /sys/fs/cgroup/memory.max
536870912
$ ls -l /.dockerenv
ls: cannot access '/.dockerenv': No such file or directory
$ ls -l /run/.containerenv
ls: cannot access '/run/.containerenv': No such file or directory

An attempt to recreate the problem with the same container “failed” as cgroup v2 settings were detected properly but container recognition did still fail. So there are some questions of how and what conditions is required for cgroup v2 detection.

Looking for a solution

  1. https://stackoverflow.com/questions/71658810/aks-cluster-create-container-image-using-cri-runtime#:~:text=The%20container%20runtime%20can%20be,docker%20image%20or%20container%20image.
  2. https://github.com/testcontainers/testcontainers-java/issues/3681
jdekonin commented 1 year ago

cgroupv2 was questioned as to be setup correctly within this environment so data collection was requested. In a working cgroupv2 environment, using unmodified websphere-liberty:22.0.0.13-kernel-java11-openj9, cgroupv2 is recognized correctly but running in a container is not.

Options are needed to work even within a working cgroupv2 environment that is running with limited resources.

Found 2 related issues on OpenJ9 https://github.com/eclipse-openj9/openj9/issues/137 https://github.com/eclipse-openj9/openj9/issues/4707