Elastic agent uses too much memory per Pod in k8s

swiatekm commented 1 week ago

In its default configuration, agent has the kubernetes provider enabled. In DaemonSet mode, this provider keeps track of data about Pods scheduled on the Node the agent is running on. This issue concerns the fact that the agent process itself uses an excessive amount of memory if the number of these Pods is high (for the purpose of this issue, this will mean close to the default Kubernetes limit of 110). This was originally discovered while troubleshooting #4729.

This effect is visible even if we disable all inputs and self-monitoring, leaving agent to run as a single process without any components. This strongly implies it has to do with configuration variable providers. I used this empty configuration in my testing to limit confounding variables from beats, but the effect is more pronounced when components using variables are present in the configuration.

Here's a graph of agent memory consumption as the number of Pods on the Node increases from 10 to 110:

A couple of observations from looking at how configuration changes affect this behaviour:

Making the garbage collector more aggressive makes most of the effect disappear, as does restarting the Pod. Very likely that this is caused by allocation churn more than steady-state heap utilization.
Disabling the host provider also reduces the effect greatly. When creating variables, each entry from a dynamic provider gets its own copy of data from context providers. On a large node, there can be quite a bit of host provider data. This is also visible when looking at variables.yml in diagnostics.
Increasing the debounce time on variable emission in the composable coordinator doesn't help much.

Test setup

Single node KiND cluster, default settings.
Default standalone kubernetes manifest, all inputs removed from configuration.
A single nginx Deployment, starting at 0 replicas, and later scaled up to the maximum the Node allows.

More data

I resized my Deployment a couple times and looked at a heap profile of the agent process:

The churn appears to be coming primarily from needing to recreate all the variables whenever a Pod is updated. The call to composable.cloneMap is where we copy data from the host provider.

Root cause

The root cause appears to be a combination of behaviours in the variable pipeline:

All variables are recreated and emitted whenever there's any change to the underlying data. In Kubernetes, with a lot of Pods on a Node, changes can be quite frequent, and the amount of data is non-trivial.
We copy all the data from all context providers to any dynamic provider mapping. If there are a lot of dynamic provider mappings (one for each Pod), this can be quite expensive.
I suspect there's also more copying going on in component model generation, but I haven't looked into it too much.

elasticmachine commented 1 week ago

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

swiatekm commented 1 week ago

Potential fixes I thought of below. Not sure how feasible they are given the provider architecture, but Vars themselves seem flexible enough to permit them:

Use a reference for the context provider mapping instead of copying it. If we can guarantee this won't be modified selectively by the receiver, it should work.
Store vars instead of mappings in the dynamic provider state. Then, just emit the stored value when needed. Also only viable if we know this won't be modified. This would prevent us from regenerating everything on each change.
dynamicProviderState.AddOrUpdate could similarly avoid some copies with additional assumptions about provider semantics.

blakerouse commented 1 week ago

@swiatekm You can assume that the data that is passed into the provider is not modified. The reason this code copies is because it ensures that the provider doesn't update the structure after sending to AddOrUpdate. It would be totally acceptable to lower memory usage to not do this, but providers need to be sure that it doesn't mutate the structure after calling AddOrUpdate (unless they call AddOrUpdate after the mutation).

faec commented 1 week ago

One aspect of this issue is that the Agent Coordinator doesn't tell variable providers which variables it actually wants, which means providers need to collect all supported data even if it's known deterministically that it will never be needed. There's an issue already involving preprocessing variable substitutions in new policies, and a natural feature to include with it is to give providers an explicit list of variables that need to be monitored. In that case, an empty policy that doesn't actually use the kubernetes metadata could avoid querying it in the first place.

swiatekm commented 1 week ago

One aspect of this issue is that the Agent Coordinator doesn't tell variable providers which variables it actually wants, which means providers need to collect all supported data even if it's known deterministically that it will never be needed. There's an issue already involving preprocessing variable substitutions in new policies, and a natural feature to include with it is to give providers an explicit list of variables that need to be monitored. In that case, an empty policy that doesn't actually use the kubernetes metadata could avoid querying it in the first place.

I think most of this data is actually used by filebeat. See an example filebeat input:

data_stream:
  dataset: kubernetes.container_logs
id: kubernetes-container-logs-nginx-d556bf558-vcdpl-4bd5a70737faebfa2fbd4d34b9003cf8f32cd086787133590342e24d331da995
index: logs-kubernetes.container_logs-default
parsers:
  - container:
      format: auto
      stream: all
paths:
  - /var/log/containers/*4bd5a70737faebfa2fbd4d34b9003cf8f32cd086787133590342e24d331da995.log
processors:
  - add_fields:
      fields:
        input_id: filestream-container-logs-4b47f8c5-5515-4267-a33d-4fb64806f81c-kubernetes-f483454b-e8f2-42b5-8c22-4da229e86b8a.nginx
      target: '@metadata'
  - add_fields:
      fields:
        dataset: kubernetes.container_logs
        namespace: default
        type: logs
      target: data_stream
  - add_fields:
      fields:
        dataset: kubernetes.container_logs
      target: event
  - add_fields:
      fields:
        stream_id: kubernetes-container-logs-nginx-d556bf558-vcdpl-4bd5a70737faebfa2fbd4d34b9003cf8f32cd086787133590342e24d331da995
      target: '@metadata'
  - add_fields:
      fields:
        id: 5206cacb-562e-46a1-b256-0b6833a0d653
        snapshot: false
        version: 8.15.0
      target: elastic_agent
  - add_fields:
      fields:
        id: 5206cacb-562e-46a1-b256-0b6833a0d653
      target: agent
  - add_fields:
      fields:
        id: 4bd5a70737faebfa2fbd4d34b9003cf8f32cd086787133590342e24d331da995
        image:
          name: nginx:1.14.2
        runtime: containerd
      target: container
  - add_fields:
      fields:
        cluster:
          name: kind
          url: kind-control-plane:6443
      target: orchestrator
  - add_fields:
      fields:
        container:
          name: nginx
        labels:
          app: nginx
          pod-template-hash: d556bf558
        namespace: default
        namespace_labels:
          kubernetes_io/metadata_name: default
        namespace_uid: 9df9c3db-a0ca-426d-bbb5-0c63092a39ae
        node:
          hostname: kind-control-plane
          labels:
            beta_kubernetes_io/arch: amd64
            beta_kubernetes_io/os: linux
            kubernetes_io/arch: amd64
            kubernetes_io/hostname: kind-control-plane
            kubernetes_io/os: linux
            node-role_kubernetes_io/control-plane: ""
          name: kind-control-plane
          uid: 0b6d3cbf-7a86-4775-9351-86f5448c21d8
        pod:
          ip: 10.244.0.102
          name: nginx-d556bf558-vcdpl
          uid: f483454b-e8f2-42b5-8c22-4da229e86b8a
        replicaset:
          name: nginx-d556bf558
      target: kubernetes
  - add_fields:
      fields:
        annotations:
          elastic_co/dataset: ""
          elastic_co/namespace: ""
          elastic_co/preserve_original_event: ""
      target: kubernetes
  - drop_fields:
      fields:
        - kubernetes.annotations.elastic_co/dataset
      ignore_missing: true
      when:
        equals:
          kubernetes:
            annotations:
              elastic_co/dataset: ""
  - drop_fields:
      fields:
        - kubernetes.annotations.elastic_co/namespace
      ignore_missing: true
      when:
        equals:
          kubernetes:
            annotations:
              elastic_co/namespace: ""
  - drop_fields:
      fields:
        - kubernetes.annotations.elastic_co/preserve_original_event
      ignore_missing: true
      when:
        equals:
          kubernetes:
            annotations:
              elastic_co/preserve_original_event: ""
  - add_tags:
      tags:
        - preserve_original_event
      when:
        and:
          - has_fields:
              - kubernetes.annotations.elastic_co/preserve_original_event
          - regexp:
              kubernetes:
                annotations:
                  elastic_co/preserve_original_event: ^(?i)true$
prospector:
  scanner:
    symlinks: true
type: filestream

faec commented 1 week ago

I think most of this data is actually used by filebeat.

This could be true, but one thing that hasn't been obvious to me looking at examples is how much unused metadata implicitly comes with the current approach. E.g. in the example you give, even though it needs different categories of Kubernetes metadata for potentially many containers/pods, all the actual substituted fields together are still quite small, and would still not be a major memory footprint even with hundreds or thousands of pods/containers. (There may be limits on how well we can act on that, though, depending on what's cached by the Kubernetes support libraries themselves.)

Anyway, if at some point there's reason to think that pruning fields would help, it would definitely be a feasible modification for the Agent Coordinator to pass the relevant/minimal field list through to variable providers.

alexsapran commented 1 week ago

@swiatekm, your comment here https://github.com/elastic/elastic-agent/issues/5835#issuecomment-2432790163 seems to be related to something else https://github.com/elastic/ingest-dev/issues/2454#issuecomment-1737569099 that investigated the impact of map.Clone in the context of the processors. Taking a look at the history would help IMHO give some context.

swiatekm commented 1 week ago

@swiatekm, your comment here #5835 (comment) seems to be related to something else elastic/ingest-dev#2454 (comment) that investigated the impact of map.Clone in the context of the processors. Taking a look at the history would help IMHO give some context.

To be clear, I wasn't making a statement about the performance of filebeat, just that the generated configuration for it appears to use a lot of the Pod metadata. This issue is about the elastic-agent binary exclusively.

blakerouse commented 1 week ago

I think the complexity of only providing the fields that are used by the provider might outweigh the memory savings unless we could determine this to be a very large amount. The complexity comes when the policy changes and now a new field is now being referenced but the provider is not providing that variable now because it was omitted from the previous policy. Now the new set of fields need to be sent to the provider and then the provider now needs to update all mappings with that new variable. I think that complexity might outweigh the benefit of such a change.

I think the patch of not cloning the mappings might be a large enough win that the need to omit fields might not be required.

I do think this issue solved https://github.com/elastic/elastic-agent/issues/3609 would be very nice to have done, and I think something that needs to be done relative to Hybrid OTel mode. Because when the agent is only running OTel configuration it should not be running any providers. That would be a memory savings for all agents that are not referencing anything kubernetes related in the policy.

swiatekm commented 1 week ago

Experimenting with removing var cloning brought me to an unusual discovery: On agent start, we spend a lot of memory on getting the host ip addresses. See these heap profiles:

I don't get why this would be the case. I verified we only call the host provider function once, and it allocating 33 MB while just getting ip addresses from Linux seems very excessive. But we only use Go stdlib functions to do this. Looking at interfaces via shell commands on the Node didn't yield anything unusual. I'll see if I can run a small program to show me if we're maybe iterating over a bunch of junk data - Kubernetes is know to mess with local networking a lot.

EDIT: Looks like this is a kind of N+1 problem with syscalls. The code lists interfaces, and then lists addresses for each interface. The second call is not cheap, so if we have a lot of interfaces on a machine - which happens on a K8s Node with a lot of Pods - it adds up to a fair amount of memory allocations. There is actually a go stdlib function that gets all the addresses in a single syscall, I'll try to use it and see if that helps.

cmacknz commented 5 days ago

@andrewkroh any quick ideas about how to bring down the memory usage from syscall.NetlinkRIB coming through go-sysinfo's shared.Network() implementation? See the profile above in https://github.com/elastic/elastic-agent/issues/5835#issuecomment-2441544789.

swiatekm commented 5 days ago

@cmacknz https://github.com/elastic/go-sysinfo/pull/246 should already help a lot here.

cmacknz commented 5 days ago

Nice! How much of an improvement was that out of curiosity?

swiatekm commented 5 days ago

Nice! How much of an improvement was that out of curiosity?

Around 4x less memory in my environment.

elastic / elastic-agent