Kubernetes metadata overwhelms memory limits in the Agent process

faec commented 4 months ago

Diagnostics from production Agents running on Kubernetes show:

The elastic-agent process itself uses more memory than all its configured inputs combined.
Within the elastic-agent process, more than 90% of memory use is in Kubernetes helpers. 70% of that is from elastic-agent-autodiscover and the other 20% is from helpers internal to elastic-agent.

We need to understand why the Kubernetes helpers are using so much memory, and find a way to mitigate it.

Definition of done

[ ] Provide steps for a reproducible setup that can demonstrate the aforementioned memory usage with an Agent diagnostic
[ ] Attach Agent diagnostic to this issue to use as a baseline, so we can compare against it when improvements are made
[ ] Reduce memory use by Kubernetes helpers from 90% to TBD% (TBD, at the moment, until we've done more investigation)

elasticmachine commented 4 months ago

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

cmacknz commented 4 months ago

Possible related, an increase starting in 8.14.0 was detected by the ECK integration tests https://github.com/elastic/elastic-agent/issues/4730

faec commented 4 months ago

Possible related, an increase starting in 8.14.0 was detected by the ECK integration tests

FWIW the diagnostics described by this issue were from 8.13.3.

elasticmachine commented 4 months ago

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

jlind23 commented 4 months ago

After chatting with @cmacknz and @pierrehilbert, assigning this to you @faec and making it a high priority for the next sprint.

bturquet commented 4 months ago

cc @gizas

faec commented 4 months ago

Agent's variable provider API is very opaque, which is probably a big part of this. Agent's Coordinator doesn't provide any constraints on what variables might be requested, hence the Kubernetes helpers make (and cache) very large / verbose state queries. https://github.com/elastic/elastic-agent/issues/2887 is related -- a possible Agent-side solution is to implement better policy parsing to validate the full configuration and give variables providers like Kubernetes a list of variables that are used.

@bturquet / @gizas, if we add hooks to the variable provider API for the Coordinator to give a list of possible variables, what work would be needed to restrict Kubernetes queries to those variables?

gizas commented 4 months ago

@faec trying to understand here how we can combine those pieces. So lets say the the parsing changes and there is a list of variables that the provider will need to populate. On kubernetes provider here we start the watchers but with general arguments.

The other metadata enrichment we do with enrichers again is unrelated with the flow you describe here.

Maybe we can sync offline for me to understand more about this?

cc @MichaelKatsoulis

alexsapran commented 3 months ago

Hi all,

I was looking at this, and I wanted to know if we are applying any filtering on the data we receive from the k8s metadata. Does all data need to be cached locally in the local k8s cache? I'd like to know if we can apply any transformation to nullify some of the fields and keep only the ones we care about; this way, the RSS memory of the Elastic Agent will hold only the data we care about and will not be influenced by the size of the k8s cluster.

neiljbrookes commented 2 months ago

Hello all @faec @ycombinator Is there any update on this issue ? I am planing an upgrade to 8.14.1 this week, do we anticipate any improvements ?

pierrehilbert commented 2 months ago

Fae is currently in PTO and unfortunately she didn't have time to investigate on this yet. This is planned in the current sprint (that started today).

rgarcia89 commented 2 months ago

We are facing this issue too. We see the elastic agents hitting the current memory limit of 1200Mi. I would greatly appreciate it if this topic could be given higher priority, as it is quite annoying to see the agents using that much memory.

pierrehilbert commented 2 months ago

Hello @rgarcia89, This topic has an high priority but as you can imagine, this was not the only one. @faec will soon start to look at this so I hope we will soon be able to share good news.

nimdanitro commented 1 month ago

FWIW, I think we could apply some meaningful transformers in the informers. We did something very similar in our mki-cost-exporter project: https://github.com/nimdanitro/mki-cost-exporter/blob/feat/poc/pkg/costmeter/meter.go#L124C14-L124C26 here is an example of the cache.TransformFunc which we set to our informers: https://github.com/nimdanitro/mki-cost-exporter/blob/feat/poc/pkg/informers/transformer/transform.go#L34

Obviously, we could ignore a large portion of the information for our specific use case.

yuvielastic commented 1 month ago

Hey Team, Any update on this issue? Given it's been acknowledged as a high priority but there are no updates on it since months is very worrisome.

Can we please prioritize this as we would need to get the agent footprint down as much as possible as provisioning 4 GB of memory would reduce the overall usable RAM available that can be used for customer workloads.

amitkanfer commented 1 month ago

It's still prioritized. Unfortunately there were other more urgent matters that we're still wrapping up.

zez3 commented 1 month ago

@faec any updates from your part?

blakerouse commented 1 month ago

@faec There is one issue that I filled a while ago that I think would help reduce memory usage in the case that a specific provider is not even being used - https://github.com/elastic/elastic-agent/issues/3609. By doing that unless the policy references a provider then there is no reason for it to even be running.

I think using the same logic as above it could build off your idea of recording exactly which variables will be referenced from the policy. Then inside of the variables storage system used by the composable module, could use this determined information to only store what is needed without having to even change the providers (it could just drop the fields not needed).

The issue is the case where a policy now starts referencing a new variable and now that information has been dropped, but the provider already provided all the required information. This is where I do believe the providers will need to be adjusted to be given the list of variables that are referenced in the policy. That will allow them to only do the minimal work required as well as notice if a new variables is added requiring it to push an update to the variable storage so that variable information is now present.

EvelienSchellekens commented 2 weeks ago

I’m running into some memory issues with Elastic Agent 8.15. It’s running on Kubernetes, and we limit the memory to 700Mi in the manifest file in Kibana. However, when enabling the system metrics + Kubernetes integration, the process keeps crashing and I get almost no data in. When I raise the limit to 800Mi, it runs stable. This seems related to this GH issue.

Here are my test results:

Elastic Agent 8.15.0 (only system metrics integration), limit 700Mi:

NAME                  CPU(cores)   MEMORY(bytes)   
elastic-agent-hkfsw   21m          442Mi

Elastic Agent 8.15.0 (system metrics integration + K8s integration), limit 700Mi: -> keeps crashing, no data

NAME                  CPU(cores)   MEMORY(bytes)   
elastic-agent-hkfsw   236m         699Mi

Elastic Agent 8.15.0 (system metrics integration + K8s integration), limit 800Mi: -> runs stable

NAME                  CPU(cores)   MEMORY(bytes)   
elastic-agent-dbzzm   52m          703Mi

This setup is being used for (marketing) workshops and it's not a great look to ask attendees to increase the memory limit when the Elastic Agent only uses 2 integrations.

gizas commented 1 week ago

We had run some scaling tests in the past that propose resource configuration ( based on 8.7) as reference point to compare.

At the moment @elastic/obs-ds-hosted-services focus is the Otel native Kubernetes Collection of logs/ metrics and we have no plans to run any scaling tests for elastic agent + integrations (cc @mlunadia ) in current iteration.

We can wait and see otel elastic agent memory consumption with latest config and also check current resourcing of elastic agent with system+k8s integration.

LucaWintergerst commented 5 days ago

this issue occurs even with very very small workloads, so it's not really about scale testing.

This is reproducible on a single node k8s cluster, with 26 total pods running

swiatekm commented 4 days ago

Posting the results of my initial investigation. For now, I'm inclined to agree with Michael's conclusion in https://github.com/elastic/sdh-beats/issues/5148#issuecomment-2352771442 that there isn't a regression here. Still, the increase in memory usage from adding more Pods to the Node seems excessive, but it's not clear where it's coming from.

Test setup

Single node KiND cluster, default settings.
Fleet-managed Agent installed as per the official instructions.
System and Kubernetes integrations with default settings (at least initially).
98 Nginx Pods running in the cluster, producing no logs.

Findings

The memory increase does seem primarily related to the kubernetes variable provider. It can be reproduced even with all the data collection disabled in the Kubernetes integration.
Memory usage does appear to scale with the number of Pods running on the Node, even if those Pods aren't actually logging anything.
Since the amount of metadata from a single Node shouldn't be enough to cause this effect, I thought that maybe we were getting unnecessary var updates from the provider. But tweaking the debounce delay value didn't provide a measurable improvement.

MichaelKatsoulis commented 4 days ago

I would also like to post some results here based on Luca's comment about the OOM in small workloads. I run some tests in multiple versions of elastic agent and I want to share the results.

I used a single node cluster in GKE with 38 pods running. Here are the results of Elastic Agent's memory consumption per version:

Version 8.15.1

Integration	Memory Consumption
no integration	280-330 Mb
system	450-500 Mb
Kubernetes	550-600 Mb
Kubernetes & system	740-790 Mb (restarts)

Version 8.14.0

Integration	Memory Consumption
no integration	260-290 Mb
system	410-430 Mb
Kubernetes	550-570 Mb
Kubernetes & system	700-730 Mb

Version 8.13.0

Integration	Memory Consumption
no integration	200-210 Mb
system	320-330 Mb
Kubernetes	500-510 Mb
Kubernetes & system	630-650 Mb

Version 8.12.0

Integration	Memory Consumption
no integration	180-185 Mb
system	300-330 Mb
Kubernetes	480-520 Mb
Kubernetes & system	630-680 Mb

Version 8.11.0

Integration	Memory Consumption
no integration	169-190 Mb
system	300-310 Mb
Kubernetes	520-550 Mb
Kubernetes & system	660-720 Mb (restart)

The easy thing to notice here is that the increase in memory that Kubernetes Integration causes to Elastic Agent is almost constant throughout the version changes. That is around 300-350 Mb. It got better actually after some better handling of metadata enrichment in 8.14.0 onwards. Elastic Agent with no integration at all memory consumption increased over the version bumps and the installation of Kubernetes and System(comes as default) reached the set limit of 700 Mb. I don't know if the 300Mb that kubernetes integration adds is a lot or not. But considering that system integration which does way less (no constant API calls to k8s) adds around 150 Mb, I could argue that is reasonable.

Another thing to note is that even without the Kubernetes Integration installed , there is still Kubernetes Provider and add_kubernetes_metadata processor enabled by default. I took a look at the heap.pprof of such an agent and Kubernetes related functions seem to be using around 10 %.

I would like to understand @faec comment more. Within the elastic-agent process, more than 90% of memory use is in Kubernetes helpers

How was this measured? With or without Kubernetes Integration? Which version?

swiatekm commented 4 days ago

@MichaelKatsoulis is this with agent monitoring enabled? I got the container memory usage to ~50Mi after disabling that, with only the elastic-agent binary running in the container. But this still increased to ~90 Mi after starting more Pods.

MichaelKatsoulis commented 4 days ago

@MichaelKatsoulis is this with agent monitoring enabled? I got the container memory usage to ~50Mi after disabling that, with only the elastic-agent binary running in the container. But this still increased to ~90 Mi after starting more Pods.

Yes it is enabled. I kept all the defaults. If disabled, memory consumption with just the binary running is around what you mentioned.

cmacknz commented 4 days ago

Elastic Agent with no integration at all memory consumption increased over the version bumps

The jump in 8.14.0 is because of agentbeat, see https://github.com/elastic/elastic-agent/issues/4730

henrikno commented 2 days ago

elastic-agent pod is using 4GB ram. Pods on that host: https://gist.github.com/henrikno/27c4165cd7eec7b3a24c424d8a8dad23, ps aux: https://gist.github.com/henrikno/92634f31dd8a3795ff1ec81b34dc1bf8, elastic-agent using 2.2GB, largest metricbeat (kubernetes-metrics) 1.6GB.

It sound a bit similar to https://github.com/topfreegames/maestro/pull/473, where the updates from k8s are coming in too fast compared to how they're getting processed, so they're getting buffered somewhere in memory.

swiatekm commented 1 day ago

Looking at the profile supplied by @henrikno, this anomalous memory consumption is caused by storing ReplicaSet data. @neiljbrookes confirmed on Slack that the K8s clusters in question have a lot of Deployments, and consequently ReplicaSets. For example, we have ~7000 Deployments and ~75000 ReplicaSets in a particularly troublesome cluster. The heap profile shows ~700 MB of steady-state memory usage, which comes out to around 10KB per ReplicaSet, which a reasonable value.

The Agents going OOM was mitigated by setting GOGC to 25, which suggests that churn from excessive updates from the API Server is part of the problem as well.

I'm planning to submit a fix that will cause us to store only the necessary data shortly. Stopping the churn is going to be a bit more challenging, but I think we should be able to solve it by only subscribing to metadata changes from these ReplicaSets. This will be more challenging to integrate into our autodiscovery framework, but is also less urgent.

Worth noting that I don't believe this is the problem causing unexpected agent memory consumption on Nodes with a lot of Pods, even in small clusters.

elastic / elastic-agent