elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
124 stars 134 forks source link

[Meta]Investigate resource consumption of Elastic Agent with K8s Integration #3801

Open gizas opened 10 months ago

gizas commented 10 months ago

Backround

The latest issues like 3863, 3991 and 4081, proved that the installation of the default configuration of Elastic Agent with our Kubernetes Integration can lead to situations were our customers result in unfortunate circumstances (even with broken k8s clusters sometimes). There are many details and variables that affect the final setup and installation of our observability solution and we can try to summarise and list them here.

Goals

This issue tries to summarise the next actions we need in order to investigate:

Actions

Current Actions

We have observed until now that: a) Memory consumption of Elastic Agent had increased from 8.8 to 8.9 versions and later of Elastic Agent (Relevant https://github.com/elastic/sdh-beats/issues/3863#issuecomment-1733750863) b) Number of API calls towards Kubernetes Control API has increased since 8.9 version (See Salesforce 01507229 regarding Elastic Agent overloading Kubernetes API server.: https://github.com/elastic/sdh-beats/issues/3991#issuecomment-1787648161) c) CPU consumption (although not such a big issue at the moment and not first priority) has been referred here as a concern.

Unti now:

Next Planned Actions

Future Plans/Actions

axw commented 10 months ago

Run tests in real k8s clusters and retrieve diagnostics from Agent trying to investigate memory consumption

Once we've resolved the issues (or earlier, if resolving them is not straightforward and we need to iterate): I think we should also figure out how to reliably reproduce the issues in an ephemeral cluster, ideally with some automation in place to create the cluster and whatever workload is necessary to trigger the issues (e.g. create a bunch of deployments/pods/whatever).

Then we can:

gizas commented 10 months ago

Thanks @axw , I have updated a bit the section Next actions and added some previous ideas/issues that we can investigate here

lucabelluccini commented 8 months ago

As a short-term, can we somehow document the known issues / limitations we're facing until now?

dimm0 commented 3 months ago

Is there progress in the latest version or it's still destroying the k8s master? I've disabled elastic in our cluster a while ago, checking if there's any progress so far. I can't really tell if it should've improved if I upgrade.

cmacknz commented 3 months ago

We have tracked down the source of the high memory usage on k8s and are working to fix it. https://github.com/elastic/elastic-agent/issues/4729 is the tracking issue.

dimm0 commented 3 months ago

And what about rate-limiting the k8s apiserver requests? Is any work going on that?

gizas commented 3 months ago

what about rate-limiting the k8s apiserver requests

Regarding rate limiting, the main issue is this which is not yet prioritised in the next iterations. But for sure it is in our backlog

Somehow related, we have already merged 3625, in order to minimise any possible effect of leader election api calls. Additionally since 8.14.0, we have done a major refactoring in 37243, which we proved that it will help the overall resource consumption

constanca-m commented 1 week ago

Test setup

I have run a script to evaluate the performance of our K8s integration. I evaluated all 8.x.0 versions between 8.5.0 and 8.15.0.

The test increases the number of pods in a one node cluster at this rhythm: 12, 61, 111, 161, 211, 311, 411, and 511.

I annotated the following results after 5min for each cycle:

Once the EA restarts, I stop registering the tests for the upcoming increase of pods, since the performance is no longer stable.

This is the script I am running for the tests. ```shell setup_cluster () { kind delete cluster kind create cluster # This is so we can execute kubectl top kubectl apply -f https://raw.githubusercontent.com/pythianarora/total-practice/master/sample-kubernetes-code/metrics-server.yaml } test_n_pods () { # $1 - EA filename to used in kubectl apply # $2 - filename for the results # Prepare cluster with EA using kubernetes + system policy setup_cluster kubectl apply -f "$1" echo "| Pods | CPU | Memory | EA pod restarts |" > "$2" echo "|------|-----|--------|-----------------|" >> "$2" for replicas in 1 50 100 150 200 300 400 500 ; do kubectl delete -f nginx-pod.yaml sed -i -e "s/ replicas: .*/ replicas: $replicas/g" nginx-pod.yaml kubectl apply -f nginx-pod.yaml sleep 5m top=$(kubectl top pods -n kube-system | grep elastic*) pods=$(kubectl get pods --no-headers --all-namespaces | wc -l) line=$(kubectl get pods -o wide --all-namespaces | awk '$2 ~ /^elastic/') restarts=$(echo "$line" | awk '{print $5}') print_results_to_file "$pods" "$top" "$restarts" "$2" done } print_results_to_file () { # Gets arguments: # $1 = number of pods # $2 = kubectl top result # $3 = number of EA restarts # $4 = results filename # Parse result of kubectl top (example 'elastic-agent-985zk 16m 583Mi') cpu=$(echo "$2" | awk '{print $2}') memory=$(echo "$2" | awk '{print $3}') echo "| $1 | $cpu | $memory | $3 |" >> "$4" } # Test the performance by running test_n_pods. Change the arguments to your own. test_n_pods ```
This is the NGINX pod deployment I use in the script. ```yaml apiVersion: apps/v1 kind: Deployment metadata: name: nginx-deployment labels: app: nginx spec: replicas: 500 selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: containers: - name: nginx image: nginx:1.14.2 ports: - containerPort: 80 ```

8.5

Using the default configuration from the agent:

resources:
  limits:
    memory: 500Mi
  requests:
    cpu: 100m
    memory: 200Mi

Results:

Pods CPU Memory EA pod restarts
12 35m 281Mi 0
61 115m 410Mi 0
111 272m 399Mi 0
161 852m 491Mi 0
211 923m 441Mi 0
311 770m 445Mi 0
411 625m 450Mi 0
511 342m 414Mi 0

8.6

Using the default configuration from the agent:

resources:
  limits:
    memory: 500Mi
  requests:
    cpu: 100m
    memory: 200Mi

Results:

Pods CPU Memory EA pod restarts
12 33m 407Mi 0
61 4

No longer works at 61 up.

8.7

Using the default configuration from the agent:

resources:
  limits:
    memory: 500Mi
  requests:
    cpu: 100m
    memory: 200Mi

Results:

Pods CPU Memory EA pod restarts
12 32m 431Mi 0
61 4

No longer works at 61 up test.

8.8 - default agent configuration changes

Using the default configuration from the agent:

resources:
  limits:
    memory: 700Mi
  requests:
    cpu: 100m
    memory: 400Mi

Results:

Pods CPU Memory EA pod restarts
12 24m 378Mi 0
61 94m 489Mi 0
111 298m 596Mi 0
161 1

No longer works at 161 up test.

8.9

Using the default configuration from the agent:

resources:
  limits:
    memory: 700Mi
  requests:
    cpu: 100m
    memory: 400Mi

Results:

Pods CPU Memory EA pod restarts
12 32m 421Mi 0
61 92m 533Mi 0
111 250m 639Mi 0
161 1

No longer works at 161 up test.

8.10

Using the default configuration from the agent:

resources:
  limits:
    memory: 700Mi
  requests:
    cpu: 100m
    memory: 400Mi

Results:

Pods CPU Memory EA pod restarts
12 25m 424Mi 0
61 90m 543Mi 0
111 2

No longer works at 111 up test.

8.11

Using the default configuration from the agent:

resources:
    limits:
        memory: 700Mi
    requests:
        cpu: 100m
        memory: 400Mi

Results:

Pods CPU Memory EA pod restarts
12 14m 435Mi 0
61 54m 577Mi 0
111 2

No longer works at 111 up test.

8.12

Using the default configuration from the agent:

resources:
    limits:
        memory: 700Mi
    requests:
        cpu: 100m
        memory: 400Mi

Results:

Pods CPU Memory EA pod restarts
12 15m 445Mi 0
61 54m 604Mi 0
111 2

No longer works at 111 up test.

8.13

Using the default configuration from the agent:

resources:
    limits:
        memory: 700Mi
    requests:
        cpu: 100m
        memory: 400Mi

Results:

Pods CPU Memory EA pod restarts
12 14m 441Mi 0
61 51m 538Mi 0
111 2

No longer works at 111 up test.

8.14

Using the default configuration from the agent:

resources:
    limits:
        memory: 700Mi
    requests:
        cpu: 100m
        memory: 400Mi

Results:

Pods CPU Memory EA pod restarts
12 13m 580Mi 0
61 1

No longer works at 61 up test.

8.15

Using the default configuration from the agent:

resources:
    limits:
        memory: 700Mi
    requests:
        cpu: 100m
        memory: 400Mi

Results:

Pods CPU Memory EA pod restarts
12 28m 595Mi 0
61 1

No longer works at 61 up test.


Notes

From 8.5 to 8.6 version, something changed that caused a huge memory increase in the Kubernetes integration, to the point that increasing the number of pods made the agent stop and restart over and over again.

From 8.8 version, the number of pods that made the agent stop increase. This is a good sign, but notice that the default memory limits and requests also increase. This surely helps explain this seemingly better performance.

From 8.9 to 8.10 version, the number of pods that caused the EA to stop and restart decreased again. Something happened again in the Kubernetes integration that affected the agent performance.

From 8.13 to 8.14 version, the number of pods that caused the EA to stop and restart decreased again. Something happened again in the Kubernetes integration that affected the agent performance. Also, from @gizas: 8.13 vs 8.14 is 140Mi diff even with no of 12 pods.

It seems Kubernetes memory usage has been getting higher since 8.5, with notable increases in 8.6, 8.10 and 8.14 (discarding the increase of memory resources by default in EA at 8.8 version that helped hide the possible issues in Kubernetes integration).

EvelienSchellekens commented 1 week ago

@constanca-m Have you also tested if the data is actually send to Elastic? My setup had ~15 pods with 8.15 and the memory ran high and even though the pod itself didn't restart, the K8s data didn't come in (or was very spotty) (I think one of the processes itself was crashing)

constanca-m commented 1 week ago

Have you also tested if the data is actually send to Elastic?

In my case, I can see data in Discover (I am filtering by kubernetes.container.name):

Image

I did not analyze the logs to know if everything is being sent there, or we are loosing data. This is the logs from running all the tests in 8.15, including the pod restarts. @EvelienSchellekens

gizas commented 1 week ago

Really useful @constanca-m !

Adding some notes here:

A general comment is that all the above tests just measure the consumption of memory with same k8s load. The identification of memory leak needs watching the trend of memory during time. Just saying that just an increase is not actually bad or good if we observe more k8s resources.

Additionally:

constanca-m commented 1 week ago

Thank you @gizas.

I think this issue and the scripts to run these tests should be placed somewhere more accessible to the team. Maybe in the future repository you mentioned on Thursday's meeting to help with identifying issues.

for eg in 8.12 I see 61 pods and memory 604Mi but 0 restarts. How is it possible with limit 500 to have more memory?

It says the limits memory is 700Mi for 8.12.

8.13 vs 8.14 is 140Mi diff even with no of 12 pods ?

It looks like it... This was just 1 tests, and values always variate a bit for each test. We could run a test with less increase in pods to capture more the differences between these latest versions.

Edit: but since 8.15 has more or less the same values as 8.14, I believe that we do have a significant difference between 8.13 and 8.14 like you pointed out. Thanks, I will include it in the notes of the original comment as well!

Does above tests include system integration?

Yes. You are correct, we don't include the tests for running just with the System, unfortunately. I agree, with would be good to also have an idea of that, but I don't believe the System here is causing any issues.

In all above tests we dont produce logs right?

This is the hard part! With the agent starting and restarting over and over again... It is very hard, and downloading the diagnosis gets stuck in a loop, and the zip never gets ready. Not sure what is going on there, but I have not payed much attention to it.

Since https://github.com/elastic/elastic-agent/pull/3593 we disable the deployment and cronjob metadata enrichment. Maybe the cluster did not have any cronjobs and the deployments were very few to see any improvement Since https://github.com/elastic/beats/pull/35483 we have introduced the replicaset and Job Metadata generation . Same as above could explain some increase

Correct. Only the default pods, EA, metrics server and the NGINX pod.


I believe the best would be to look at the changelog, see what big changes we had. I can remember the watchers issue, but since that PR has the memory tests there, I don't believe that could cause any influence on the degraded performance, but I could of course be wrong (and biased 😄 ).

gizas commented 1 week ago

@constanca-m the https://github.com/elastic/k8s-integration-infra?tab=readme-ov-file#put-load-on-the-cluster script mentioning in the call. (public repo)

constanca-m commented 1 week ago

I used a different one @gizas, it is local and more simplified (in the comment of the tests results). I think it should be enough for these tests, and that script for more complex tests.

MichaelKatsoulis commented 1 week ago

I also performed some scale tests. I create one node cluster in GKE with ~95 pods running. I tested versions 8.13.0 and 8.14.0 with and without kube-state-metrics to simulate leader and not leader nodes scenarios.

TBH the 700 mb memory limit suffices in both versions. Only in case Kube-state-metrics are enabled I got one restart which means that in big clusters (note that in Kubernetes 110 pods per node is the limit) the memory limit needs some adjustment. Versions 8.13.0 and 8.14.0 do not seem to have big differences. For 8.13 agent's pod memory was around 600Mb , while for 8.14.0 it was around 640Mb. In all cases I used nginx pods but there was really many logs generated.

I don't know why @constanca-m got different results.