Open 1lann opened 10 months ago
Nice detailed writeup!
My suggestion is that CRDB first checks
memory.high
beforememory.max
, and should prefer a memory limit specified bymemory.high
. This should make it behave in a more Kubernetes friendly way on cgroups v2, at least whenMemoryQoS
is enabled.
I agree that there's an opportunity here to update the logic in pkg/util/cgroups
to prefer memory.high
before checking memory.max
. I just wanted to note for future readers that both memory.high
and memory.max
can be set to max
, which will trigger the following logic:
If this happens in a Kubernetes pod, there's a chance the pod can be evicted due to memory usage.
cgroups v2 introduces a new
memory.high
soft limit used for throttling (also referred to as "pressure stall") when processes exceed this limit. See https://docs.kernel.org/admin-guide/cgroup-v2.html#memory-interface-files and https://facebookmicrosites.github.io/cgroup2/docs/memory-controller.htmlSoftware can then use this memory pressure information to determine whether they should reclaim memory back to the OS. In practice, this is used by Kubernetes for an (alpha) feature MemoryQoS which calculates a suitable per-container
memory.high
value based on pod allocable memory, requested memory, and memory limits. So I think this is a more suitable number for CRDB's memory use to target as we know that CRDB can sometimes exceed its detected (hard) memory limits.The reclaim pressure provided by
memory.high
can also prevent k8s from evicting CRDB for high memory use if CRDB's page cache use high (see https://github.com/kubernetes/kubernetes/issues/43916).cgroups v2 currently doesn't have a way of determining the effective memory limits of a child cgroup if it is being constrained by a parent cgroup. i.e. if cgroup
/kubepods.slice
has amemory.max
of6GiB
, and/kubepods.slice/my-container.slice
has amemory.max
ofmax
(default k8s uses when no memory limit is set), then/sys/fs/cgroup/memory.max
from withinmy-container
will reportmax
, even though the effective limit is actually6GiB
.This is different from cgroups v1 where
memory.limit_in_bytes
from inside the child cgroup did actually report effective memory limits.CRDB currently only checks
memory.max
andmemory.limit_in_bytes
for cgroup memory limits.My suggestion is that CRDB first checks
memory.high
beforememory.max
, and should prefer a memory limit specified bymemory.high
. This should make it behave in a more Kubernetes friendly way on cgroups v2, at least whenMemoryQoS
is enabled.Related internal Slack thread: https://cockroachlabs.slack.com/archives/C04HQCNHGEP/p1700517717112919
Jira issue: CRDB-33675