The agent should more clearly indicate when it or its sub-processes have been OOM killed on Kubernetes

cmacknz commented 1 year ago

We need to make it easier to detect inadequate memory limits on Kubernetes, which are extremely common.

The agent should detect when its last status was OOM killed and report its status as degraded. Detecting that an agent has been OOMKilled from diagnostics along is not easy, it must be inferred from process restarts appearing the agent diagnostics with no other plausible explanations.

Today the primary way for us to detect this is to instruct users to run kubectl describe pod and look for the following:

       Last State:   Terminated
       Reason:       OOMKilled
       Exit Code:    137

We should automate this process and have the agent read the last state and reason for itself and report it in the agent status report.

We have also seen cases where the agent sub-processes are killed and restarted without the agent process itself being OOMKilled (because the sub-processes use more memory). We should double check that the OOMKilled reason appears on the pod when this happens.

The OOM kill event also appears in the node kernel logs if we end up needing to look there:

Mar 13 20:37:14 aks-default-32489819 kernel: [2442796.469054] Memory cgroup out of memory: Killed process 2532535 (filebeat) total-vm:2766604kB, anon-rss:1298484kB, file-rss:71456kB, shmem-rss:0kB, UID:0 pgtables:2992kB oom_score_adj:-997
Mar 13 20:37:14 aks-default-32489819 systemd[1]: cri-containerd-8a7c9177c7f2c619df882ecfebb3895c.scope: A process of this unit has been killed by the OOM killer.

elasticmachine commented 1 year ago

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

cmacknz commented 7 months ago

I think we will need to experiment with a few different scenarios to test this properly:

The agent container going over its configured memory limit because one of the sub-processes (e.g. Filebeat is using too much memory).
The agent container staying under its limit, but the node it is running on running out of memory. This can be triggered by having the individual containers stay under a large memory limit while the sum of their actual memory consumption is greater than the memory available on the node.

leehinman commented 7 months ago

Just so we don't forget. If the ExitCode is -1, that signals that "process hasn't exited or was terminated by a signal". We currently just log the ExitCode if a subprocess exits. We could add to the error message if the error is -1 that this is potentially OOM or at least that the process is getting killed via an external mechanism.

cmacknz commented 6 months ago

The reporting we get from k8s when a pod is OOMKilled differs based on the Kubernetes version.

Starting from Kubernetes 1.28 the memory.oom.group feature of cgroups v2 is turned on by default, so the pod will be OOM killed if any process in the container cgroup hits a memory limit.

Prior versions have memory.oom.group turned off, so the pods won't be annotated with the OOMKilled last exit reason. Most of our memory consumption happens in sub-processes so we hit this situation frequently.

Kubernetes change log for reference: https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.28.md

If using cgroups v2, then the cgroup aware OOM killer will be enabled for container cgroups via memory.oom.group . This causes processes within the cgroup to be treated as a unit and killed simultaneously in the event of an OOM kill on any process in the cgroup. (#117793, @tzneal) [SIG Apps, Node and Testing]

elastic / elastic-agent

The agent should more clearly indicate when it or its sub-processes have been OOM killed on Kubernetes #3641