server/status: prefer `memory.high` for memory limit detection for cgroups v2

cgroups v2 introduces a new memory.high soft limit used for throttling (also referred to as "pressure stall") when processes exceed this limit. See https://docs.kernel.org/admin-guide/cgroup-v2.html#memory-interface-files and https://facebookmicrosites.github.io/cgroup2/docs/memory-controller.html

Software can then use this memory pressure information to determine whether they should reclaim memory back to the OS. In practice, this is used by Kubernetes for an (alpha) feature MemoryQoS which calculates a suitable per-container memory.high value based on pod allocable memory, requested memory, and memory limits. So I think this is a more suitable number for CRDB's memory use to target as we know that CRDB can sometimes exceed its detected (hard) memory limits.

The reclaim pressure provided by memory.high can also prevent k8s from evicting CRDB for high memory use if CRDB's page cache use high (see https://github.com/kubernetes/kubernetes/issues/43916).

cgroups v2 currently doesn't have a way of determining the effective memory limits of a child cgroup if it is being constrained by a parent cgroup. i.e. if cgroup /kubepods.slice has a memory.max of 6GiB, and /kubepods.slice/my-container.slice has a memory.max of max (default k8s uses when no memory limit is set), then /sys/fs/cgroup/memory.max from within my-container will report max, even though the effective limit is actually 6GiB.

This is different from cgroups v1 where memory.limit_in_bytes from inside the child cgroup did actually report effective memory limits.

CRDB currently only checks memory.max and memory.limit_in_bytes for cgroup memory limits.

My suggestion is that CRDB first checks memory.high before memory.max, and should prefer a memory limit specified by memory.high. This should make it behave in a more Kubernetes friendly way on cgroups v2, at least when MemoryQoS is enabled.

Related internal Slack thread: https://cockroachlabs.slack.com/archives/C04HQCNHGEP/p1700517717112919

Jira issue: CRDB-33675

Nice detailed writeup!

My suggestion is that CRDB first checks memory.high before memory.max, and should prefer a memory limit specified by memory.high. This should make it behave in a more Kubernetes friendly way on cgroups v2, at least when MemoryQoS is enabled.

I agree that there's an opportunity here to update the logic in pkg/util/cgroups to prefer memory.high before checking memory.max. I just wanted to note for future readers that both memory.high and memory.max can be set to max, which will trigger the following logic:

https://github.com/cockroachdb/cockroach/blob/88233d4c4dd8d287105477be8fdbafd774534c9d/pkg/util/cgroups/cgroups.go#L447-L450

If this happens in a Kubernetes pod, there's a chance the pod can be evicted due to memory usage.

cockroachdb / cockroach

server/status: prefer `memory.high` for memory limit detection for cgroups v2 #114774