Requesting diagnostics can cause bursts of high memory usage

michalpristas commented 1 year ago

When trying to request diagnostics from an agent running on k8s deployment on AKS it is not successful. It appears that requesting diagnostics can cause the memory usage to burst up above the previous steady state level.

Danouchka commented 1 year ago

Same for GKE

Danouchka commented 1 year ago

cc @valerioarvizzigno @hemantmalik @eric-lowry

cmacknz commented 1 year ago

The issue description here is very vague. I don't know why the diagnostics functionality would be at all tied to the Kubernetes environment it is deployed into.

What I think is much more likely is that collecting diagnostics triggers a spike in the memory usage of the agent that could cause it to be OOMKilled.

On an affected agent can we run kubectl describe pod and look for the following?

       Last State:   Terminated
       Reason:       OOMKilled
       Exit Code:    137

Danouchka commented 1 year ago

Hello

Indeed, seems this oom event happened when I requested the diagnostics

Name: elastic-agent-hbdrn Namespace: kube-system Priority: 0 Service Account: elastic-agent Node: aks-userpool-61251298-vmss000000/10.224.0.4 Start Time: Wed, 17 May 2023 11:26:48 +0000 Labels: app=elastic-agent controller-revision-hash=5bb484d6d8 pod-template-generation=1 Annotations: Status: Running IP: 10.224.0.4 IPs: IP: 10.224.0.4 Controlled By: DaemonSet/elastic-agent Containers: elastic-agent: Container ID: containerd://7c680c9d40efc1079a045f93184e821bc78a730e518df09546b3aff36ebcc67b Image: docker.elastic.co/beats/elastic-agent:8.7.1 Image ID: docker.elastic.co/beats/elastic-agent@sha256:c916a16360ef0d8851a7012d8de3981c0e82f60ae1bc537a1798e5727b13d8e3 Port: Host Port: State: Running Started: Wed, 17 May 2023 18:42:28 +0000 Last State: Terminated Reason: OOMKilled Exit Code: 137 Started: Wed, 17 May 2023 11:31:48 +0000 Finished: Wed, 17 May 2023 18:42:27 +0000 Ready: True Restart Count: 2 Limits: memory: 700Mi Requests: cpu: 300m memory: 400Mi Environment: FLEET_INSECURE: true FLEET_URL: https://b9cc062431d64ac0a02edf3ea543dd0a.fleet.europe-west1.gcp.cloud.es.io:443 KIBANA_FLEET_PASSWORD: changeme KUBERNETES_PORT_443_TCP_ADDR: sa-da-aks-01-dns-5p6pkhjo.hcp.westeurope.azmk8s.io KUBERNETES_PORT: tcp://sa-da-aks-01-dns-5p6pkhjo.hcp.westeurope.azmk8s.io:443 KUBERNETES_PORT_443_TCP: tcp://sa-da-aks-01-dns-5p6pkhjo.hcp.westeurope.azmk8s.io:443 FLEET_ENROLL: 1 FLEET_ENROLLMENT_TOKEN: SHo5dUpvZ0Izdm12ZFpUUjYxM0M6R0EwN1N0WHBSei0wZTJabkJmNERyQQ== KIBANA_HOST: http://kibana:5601 KIBANA_FLEET_USERNAME: elastic NODE_NAME: (v1:spec.nodeName) POD_NAME: elastic-agent-hbdrn (v1:metadata.name) KUBERNETES_SERVICE_HOST: sa-da-aks-01-dns-5p6pkhjo.hcp.westeurope.azmk8s.io Mounts: /etc/machine-id from etc-mid (ro) /hostfs/etc from etc-full (ro) /hostfs/proc from proc (ro) /hostfs/sys/fs/cgroup from cgroup (ro) /hostfs/var/lib from var-lib (ro) /var/lib/docker/containers from varlibdockercontainers (ro) /var/log from varlog (ro) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-6nlxh (ro) Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True Volumes: proc: Type: HostPath (bare host directory volume) Path: /proc HostPathType:
cgroup: Type: HostPath (bare host directory volume) Path: /sys/fs/cgroup HostPathType:
varlibdockercontainers: Type: HostPath (bare host directory volume) Path: /var/lib/docker/containers HostPathType:
varlog: Type: HostPath (bare host directory volume) Path: /var/log HostPathType:
etc-full: Type: HostPath (bare host directory volume) Path: /etc HostPathType:
var-lib: Type: HostPath (bare host directory volume) Path: /var/lib HostPathType:
etc-mid: Type: HostPath (bare host directory volume) Path: /etc/machine-id HostPathType: File kube-api-access-6nlxh: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: true QoS Class: Burstable Node-Selectors: Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/network-unavailable:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists node.kubernetes.io/pid-pressure:NoSchedule op=Exists node.kubernetes.io/unreachable:NoExecute op=Exists node.kubernetes.io/unschedulable:NoSchedule op=Exists Events: Type Reason Age From Message

Normal Pulled 5m33s (x3 over 7h21m) kubelet Container image "docker.elastic.co/beats/elastic-agent:8.7.1" already present on machine Normal Created 5m33s (x3 over 7h21m) kubelet Created container elastic-agent Normal Started 5m33s (x3 over 7h21m) kubelet Started container elastic-agent

Danouchka commented 1 year ago

Indeed, I had to set the memory limit in the manifest to 1000Mi so that elastic-diagnostics can be extracted remotely from Kibana. cc @pierrehilbert

cmacknz commented 1 year ago

Thanks for confirming, agreed that having to increase the memory limit from 700Mi to 1000Mi just to allow requesting diagnostics is not ideal. There might be something we can do to improve this, perhaps we are holding too much of the diagnostics content in memory before writing it to disk.

Danouchka commented 1 year ago

On my GKE overloaded cluster, that would be nearly impossible to increase to that level, that would require spreading the applications on 2 or 3 other nodes to leave room for elastic-agent memory consumption

elastic / elastic-agent

Requesting diagnostics can cause bursts of high memory usage #2702