Open cmacknz opened 1 year ago
Pinging @elastic/elastic-agent (Team:Elastic-Agent)
I think we will need to experiment with a few different scenarios to test this properly:
Just so we don't forget. If the ExitCode is -1
, that signals that "process hasn't exited or was terminated by a signal". We currently just log the ExitCode if a subprocess exits. We could add to the error message if the error is -1
that this is potentially OOM or at least that the process is getting killed via an external mechanism.
The reporting we get from k8s when a pod is OOMKilled differs based on the Kubernetes version.
Starting from Kubernetes 1.28 the memory.oom.group
feature of cgroups v2 is turned on by default, so the pod will be OOM killed if any process in the container cgroup hits a memory limit.
Prior versions have memory.oom.group
turned off, so the pods won't be annotated with the OOMKilled
last exit reason. Most of our memory consumption happens in sub-processes so we hit this situation frequently.
Kubernetes change log for reference: https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.28.md
If using cgroups v2, then the cgroup aware OOM killer will be enabled for container cgroups via memory.oom.group . This causes processes within the cgroup to be treated as a unit and killed simultaneously in the event of an OOM kill on any process in the cgroup. (#117793, @tzneal) [SIG Apps, Node and Testing]
We need to make it easier to detect inadequate memory limits on Kubernetes, which are extremely common.
The agent should detect when its last status was OOM killed and report its status as degraded. Detecting that an agent has been OOMKilled from diagnostics along is not easy, it must be inferred from process restarts appearing the agent diagnostics with no other plausible explanations.
Today the primary way for us to detect this is to instruct users to run
kubectl describe pod
and look for the following:We should automate this process and have the agent read the last state and reason for itself and report it in the agent status report.
We have also seen cases where the agent sub-processes are killed and restarted without the agent process itself being OOMKilled (because the sub-processes use more memory). We should double check that the OOMKilled reason appears on the pod when this happens.
The OOM kill event also appears in the node kernel logs if we end up needing to look there: