Open faec opened 4 months ago
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)
Possible related, an increase starting in 8.14.0 was detected by the ECK integration tests https://github.com/elastic/elastic-agent/issues/4730
Possible related, an increase starting in 8.14.0 was detected by the ECK integration tests
FWIW the diagnostics described by this issue were from 8.13.3.
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)
After chatting with @cmacknz and @pierrehilbert, assigning this to you @faec and making it a high priority for the next sprint.
cc @gizas
Agent's variable provider API is very opaque, which is probably a big part of this. Agent's Coordinator
doesn't provide any constraints on what variables might be requested, hence the Kubernetes helpers make (and cache) very large / verbose state queries. https://github.com/elastic/elastic-agent/issues/2887 is related -- a possible Agent-side solution is to implement better policy parsing to validate the full configuration and give variables providers like Kubernetes a list of variables that are used.
@bturquet / @gizas, if we add hooks to the variable provider API for the Coordinator to give a list of possible variables, what work would be needed to restrict Kubernetes queries to those variables?
@faec trying to understand here how we can combine those pieces. So lets say the the parsing changes and there is a list of variables that the provider will need to populate. On kubernetes provider here we start the watchers but with general arguments.
The other metadata enrichment we do with enrichers again is unrelated with the flow you describe here.
Maybe we can sync offline for me to understand more about this?
cc @MichaelKatsoulis
Hi all,
I was looking at this, and I wanted to know if we are applying any filtering on the data we receive from the k8s metadata. Does all data need to be cached locally in the local k8s cache? I'd like to know if we can apply any transformation to nullify some of the fields and keep only the ones we care about; this way, the RSS memory of the Elastic Agent will hold only the data we care about and will not be influenced by the size of the k8s cluster.
Hello all @faec @ycombinator Is there any update on this issue ? I am planing an upgrade to 8.14.1 this week, do we anticipate any improvements ?
Fae is currently in PTO and unfortunately she didn't have time to investigate on this yet. This is planned in the current sprint (that started today).
We are facing this issue too. We see the elastic agents hitting the current memory limit of 1200Mi. I would greatly appreciate it if this topic could be given higher priority, as it is quite annoying to see the agents using that much memory.
Hello @rgarcia89, This topic has an high priority but as you can imagine, this was not the only one. @faec will soon start to look at this so I hope we will soon be able to share good news.
FWIW, I think we could apply some meaningful transformers in the informers. We did something very similar in our mki-cost-exporter project: https://github.com/nimdanitro/mki-cost-exporter/blob/feat/poc/pkg/costmeter/meter.go#L124C14-L124C26 here is an example of the cache.TransformFunc which we set to our informers: https://github.com/nimdanitro/mki-cost-exporter/blob/feat/poc/pkg/informers/transformer/transform.go#L34
Obviously, we could ignore a large portion of the information for our specific use case.
Hey Team, Any update on this issue? Given it's been acknowledged as a high priority but there are no updates on it since months is very worrisome.
Can we please prioritize this as we would need to get the agent footprint down as much as possible as provisioning 4 GB of memory would reduce the overall usable RAM available that can be used for customer workloads.
It's still prioritized. Unfortunately there were other more urgent matters that we're still wrapping up.
@faec any updates from your part?
@faec There is one issue that I filled a while ago that I think would help reduce memory usage in the case that a specific provider is not even being used - https://github.com/elastic/elastic-agent/issues/3609. By doing that unless the policy references a provider then there is no reason for it to even be running.
I think using the same logic as above it could build off your idea of recording exactly which variables will be referenced from the policy. Then inside of the variables storage system used by the composable module, could use this determined information to only store what is needed without having to even change the providers (it could just drop the fields not needed).
The issue is the case where a policy now starts referencing a new variable and now that information has been dropped, but the provider already provided all the required information. This is where I do believe the providers will need to be adjusted to be given the list of variables that are referenced in the policy. That will allow them to only do the minimal work required as well as notice if a new variables is added requiring it to push an update to the variable storage so that variable information is now present.
I’m running into some memory issues with Elastic Agent 8.15. It’s running on Kubernetes, and we limit the memory to 700Mi in the manifest file in Kibana. However, when enabling the system metrics + Kubernetes integration, the process keeps crashing and I get almost no data in. When I raise the limit to 800Mi, it runs stable. This seems related to this GH issue.
Here are my test results:
Elastic Agent 8.15.0 (only system metrics integration), limit 700Mi:
NAME CPU(cores) MEMORY(bytes)
elastic-agent-hkfsw 21m 442Mi
Elastic Agent 8.15.0 (system metrics integration + K8s integration), limit 700Mi: -> keeps crashing, no data
NAME CPU(cores) MEMORY(bytes)
elastic-agent-hkfsw 236m 699Mi
Elastic Agent 8.15.0 (system metrics integration + K8s integration), limit 800Mi: -> runs stable
NAME CPU(cores) MEMORY(bytes)
elastic-agent-dbzzm 52m 703Mi
This setup is being used for (marketing) workshops and it's not a great look to ask attendees to increase the memory limit when the Elastic Agent only uses 2 integrations.
We had run some scaling tests in the past that propose resource configuration ( based on 8.7) as reference point to compare.
At the moment @elastic/obs-ds-hosted-services focus is the Otel native Kubernetes Collection of logs/ metrics and we have no plans to run any scaling tests for elastic agent + integrations (cc @mlunadia ) in current iteration.
We can wait and see otel elastic agent memory consumption with latest config and also check current resourcing of elastic agent with system+k8s integration.
this issue occurs even with very very small workloads, so it's not really about scale testing.
This is reproducible on a single node k8s cluster, with 26 total pods running
Posting the results of my initial investigation. For now, I'm inclined to agree with Michael's conclusion in https://github.com/elastic/sdh-beats/issues/5148#issuecomment-2352771442 that there isn't a regression here. Still, the increase in memory usage from adding more Pods to the Node seems excessive, but it's not clear where it's coming from.
I would also like to post some results here based on Luca's comment about the OOM in small workloads. I run some tests in multiple versions of elastic agent and I want to share the results.
I used a single node cluster in GKE with 38 pods running. Here are the results of Elastic Agent's memory consumption per version:
Integration | Memory Consumption |
---|---|
no integration | 280-330 Mb |
system | 450-500 Mb |
Kubernetes | 550-600 Mb |
Kubernetes & system | 740-790 Mb (restarts) |
Integration | Memory Consumption |
---|---|
no integration | 260-290 Mb |
system | 410-430 Mb |
Kubernetes | 550-570 Mb |
Kubernetes & system | 700-730 Mb |
Integration | Memory Consumption |
---|---|
no integration | 200-210 Mb |
system | 320-330 Mb |
Kubernetes | 500-510 Mb |
Kubernetes & system | 630-650 Mb |
Integration | Memory Consumption |
---|---|
no integration | 180-185 Mb |
system | 300-330 Mb |
Kubernetes | 480-520 Mb |
Kubernetes & system | 630-680 Mb |
Integration | Memory Consumption |
---|---|
no integration | 169-190 Mb |
system | 300-310 Mb |
Kubernetes | 520-550 Mb |
Kubernetes & system | 660-720 Mb (restart) |
The easy thing to notice here is that the increase in memory that Kubernetes Integration causes to Elastic Agent is almost constant throughout the version changes. That is around 300-350 Mb. It got better actually after some better handling of metadata enrichment in 8.14.0 onwards. Elastic Agent with no integration at all memory consumption increased over the version bumps and the installation of Kubernetes and System(comes as default) reached the set limit of 700 Mb. I don't know if the 300Mb that kubernetes integration adds is a lot or not. But considering that system integration which does way less (no constant API calls to k8s) adds around 150 Mb, I could argue that is reasonable.
Another thing to note is that even without the Kubernetes Integration installed , there is still Kubernetes Provider and add_kubernetes_metadata
processor enabled by default. I took a look at the heap.pprof of such an agent and Kubernetes related functions seem to be using around 10 %.
I would like to understand @faec comment more.
Within the elastic-agent process, more than 90% of memory use is in Kubernetes helpers
How was this measured? With or without Kubernetes Integration? Which version?
@MichaelKatsoulis is this with agent monitoring enabled? I got the container memory usage to ~50Mi after disabling that, with only the elastic-agent binary running in the container. But this still increased to ~90 Mi after starting more Pods.
@MichaelKatsoulis is this with agent monitoring enabled? I got the container memory usage to ~50Mi after disabling that, with only the elastic-agent binary running in the container. But this still increased to ~90 Mi after starting more Pods.
Yes it is enabled. I kept all the defaults. If disabled, memory consumption with just the binary running is around what you mentioned.
Elastic Agent with no integration at all memory consumption increased over the version bumps
The jump in 8.14.0 is because of agentbeat, see https://github.com/elastic/elastic-agent/issues/4730
elastic-agent pod is using 4GB ram. Pods on that host: https://gist.github.com/henrikno/27c4165cd7eec7b3a24c424d8a8dad23, ps aux: https://gist.github.com/henrikno/92634f31dd8a3795ff1ec81b34dc1bf8, elastic-agent using 2.2GB, largest metricbeat (kubernetes-metrics) 1.6GB.
It sound a bit similar to https://github.com/topfreegames/maestro/pull/473, where the updates from k8s are coming in too fast compared to how they're getting processed, so they're getting buffered somewhere in memory.
Looking at the profile supplied by @henrikno, this anomalous memory consumption is caused by storing ReplicaSet data. @neiljbrookes confirmed on Slack that the K8s clusters in question have a lot of Deployments, and consequently ReplicaSets. For example, we have ~7000 Deployments and ~75000 ReplicaSets in a particularly troublesome cluster. The heap profile shows ~700 MB of steady-state memory usage, which comes out to around 10KB per ReplicaSet, which a reasonable value.
The Agents going OOM was mitigated by setting GOGC
to 25, which suggests that churn from excessive updates from the API Server is part of the problem as well.
I'm planning to submit a fix that will cause us to store only the necessary data shortly. Stopping the churn is going to be a bit more challenging, but I think we should be able to solve it by only subscribing to metadata changes from these ReplicaSets. This will be more challenging to integrate into our autodiscovery framework, but is also less urgent.
Worth noting that I don't believe this is the problem causing unexpected agent memory consumption on Nodes with a lot of Pods, even in small clusters.
Diagnostics from production Agents running on Kubernetes show:
elastic-agent-autodiscover
and the other 20% is from helpers internal toelastic-agent
.We need to understand why the Kubernetes helpers are using so much memory, and find a way to mitigate it.
Definition of done