[Meta]Investigate resource consumption of Elastic Agent with K8s Integration

gizas commented 10 months ago

Backround

The latest issues like 3863, 3991 and 4081, proved that the installation of the default configuration of Elastic Agent with our Kubernetes Integration can lead to situations were our customers result in unfortunate circumstances (even with broken k8s clusters sometimes). There are many details and variables that affect the final setup and installation of our observability solution and we can try to summarise and list them here.

Goals

This issue tries to summarise the next actions we need in order to investigate:

The current resource consumption of default Elastic Agent with K8s Integration
Several alternative ways that we can offer in order to minimise the impact in different k8s environments and customer setups, regarding resource consumption of k8s cluster.

Actions

Current Actions

We have observed until now that: a) Memory consumption of Elastic Agent had increased from 8.8 to 8.9 versions and later of Elastic Agent (Relevant https://github.com/elastic/sdh-beats/issues/3863#issuecomment-1733750863) b) Number of API calls towards Kubernetes Control API has increased since 8.9 version (See Salesforce 01507229 regarding Elastic Agent overloading Kubernetes API server.: https://github.com/elastic/sdh-beats/issues/3991#issuecomment-1787648161) c) CPU consumption (although not such a big issue at the moment and not first priority) has been referred here as a concern.

Unti now:

Since 8.11 we have updated the elastic-agent-autodiscover, beats PR to v0.6.4. Disabling metadata for deployment and cronjob. Pods that will be created from deployments or cronjobs will not have the extra metadata field for kubernetes.deployment or kubernetes.cronjob.
We have merged leader election configuration variables
Proposing a way to disable Leader Election in Managed Elastic Agents (See here)

Next Planned Actions

Future Plans/Actions

[ ] Propose https://kubernetes.io/docs/concepts/cluster-administration/flow-control/
[ ] Run tests in real k8s clusters and retrieve diagnostics from Agent trying to investigate memory consumption
[ ] Check with audit logs or other relevant way the number of API calls made by Agent. Try to suggest rate limiting API calls on startup of agent or in case of errors
- [ ] https://github.com/elastic/beats/issues/37922
[ ] Any other solutions/ fixes that might come from the investigation should be linked here
[x] https://github.com/elastic/elastic-agent/issues/4122
[ ] Automate the cluster creation and reproduction of specific issues

axw commented 10 months ago

Run tests in real k8s clusters and retrieve diagnostics from Agent trying to investigate memory consumption

Once we've resolved the issues (or earlier, if resolving them is not straightforward and we need to iterate): I think we should also figure out how to reliably reproduce the issues in an ephemeral cluster, ideally with some automation in place to create the cluster and whatever workload is necessary to trigger the issues (e.g. create a bunch of deployments/pods/whatever).

Then we can:

consider performing those tests regularly to ensure we don't regress
more rapidly iterate on improvements and bug fixes

gizas commented 10 months ago

Thanks @axw , I have updated a bit the section Next actions and added some previous ideas/issues that we can investigate here

lucabelluccini commented 8 months ago

As a short-term, can we somehow document the known issues / limitations we're facing until now?

dimm0 commented 3 months ago

Is there progress in the latest version or it's still destroying the k8s master? I've disabled elastic in our cluster a while ago, checking if there's any progress so far. I can't really tell if it should've improved if I upgrade.

cmacknz commented 3 months ago

We have tracked down the source of the high memory usage on k8s and are working to fix it. https://github.com/elastic/elastic-agent/issues/4729 is the tracking issue.

dimm0 commented 3 months ago

And what about rate-limiting the k8s apiserver requests? Is any work going on that?

gizas commented 3 months ago

what about rate-limiting the k8s apiserver requests

Regarding rate limiting, the main issue is this which is not yet prioritised in the next iterations. But for sure it is in our backlog

Somehow related, we have already merged 3625, in order to minimise any possible effect of leader election api calls. Additionally since 8.14.0, we have done a major refactoring in 37243, which we proved that it will help the overall resource consumption

constanca-m commented 1 week ago

Test setup

I have run a script to evaluate the performance of our K8s integration. I evaluated all 8.x.0 versions between 8.5.0 and 8.15.0.

The test increases the number of pods in a one node cluster at this rhythm: 12, 61, 111, 161, 211, 311, 411, and 511.

I annotated the following results after 5min for each cycle:

Pods: number of pods in the cluster.
CPU: CPU usage of EA.
Memory: Memory usage of EA.
EA pod restarts: Restarts of EA so far.

Once the EA restarts, I stop registering the tests for the upcoming increase of pods, since the performance is no longer stable.

This is the script I am running for the tests.

```shell setup_cluster () { kind delete cluster kind create cluster # This is so we can execute kubectl top kubectl apply -f https://raw.githubusercontent.com/pythianarora/total-practice/master/sample-kubernetes-code/metrics-server.yaml } test_n_pods () { # $1 - EA filename to used in kubectl apply # $2 - filename for the results # Prepare cluster with EA using kubernetes + system policy setup_cluster kubectl apply -f "$1" echo "| Pods | CPU | Memory | EA pod restarts |" > "$2" echo "|------|-----|--------|-----------------|" >> "$2" for replicas in 1 50 100 150 200 300 400 500 ; do kubectl delete -f nginx-pod.yaml sed -i -e "s/ replicas: .*/ replicas: $replicas/g" nginx-pod.yaml kubectl apply -f nginx-pod.yaml sleep 5m top=$(kubectl top pods -n kube-system | grep elastic*) pods=$(kubectl get pods --no-headers --all-namespaces | wc -l) line=$(kubectl get pods -o wide --all-namespaces | awk '$2 ~ /^elastic/') restarts=$(echo "$line" | awk '{print $5}') print_results_to_file "$pods" "$top" "$restarts" "$2" done } print_results_to_file () { # Gets arguments: # $1 = number of pods # $2 = kubectl top result # $3 = number of EA restarts # $4 = results filename # Parse result of kubectl top (example 'elastic-agent-985zk 16m 583Mi') cpu=$(echo "$2" | awk '{print $2}') memory=$(echo "$2" | awk '{print $3}') echo "| $1 | $cpu | $memory | $3 |" >> "$4" } # Test the performance by running test_n_pods. Change the arguments to your own. test_n_pods ```

This is the NGINX pod deployment I use in the script.

```yaml apiVersion: apps/v1 kind: Deployment metadata: name: nginx-deployment labels: app: nginx spec: replicas: 500 selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: containers: - name: nginx image: nginx:1.14.2 ports: - containerPort: 80 ```