Open gizas opened 10 months ago
Run tests in real k8s clusters and retrieve diagnostics from Agent trying to investigate memory consumption
Once we've resolved the issues (or earlier, if resolving them is not straightforward and we need to iterate): I think we should also figure out how to reliably reproduce the issues in an ephemeral cluster, ideally with some automation in place to create the cluster and whatever workload is necessary to trigger the issues (e.g. create a bunch of deployments/pods/whatever).
Then we can:
Thanks @axw , I have updated a bit the section Next actions
and added some previous ideas/issues that we can investigate here
As a short-term, can we somehow document the known issues / limitations we're facing until now?
Is there progress in the latest version or it's still destroying the k8s master? I've disabled elastic in our cluster a while ago, checking if there's any progress so far. I can't really tell if it should've improved if I upgrade.
We have tracked down the source of the high memory usage on k8s and are working to fix it. https://github.com/elastic/elastic-agent/issues/4729 is the tracking issue.
And what about rate-limiting the k8s apiserver requests? Is any work going on that?
what about rate-limiting the k8s apiserver requests
Regarding rate limiting, the main issue is this which is not yet prioritised in the next iterations. But for sure it is in our backlog
Somehow related, we have already merged 3625, in order to minimise any possible effect of leader election api calls. Additionally since 8.14.0, we have done a major refactoring in 37243, which we proved that it will help the overall resource consumption
I have run a script to evaluate the performance of our K8s integration. I evaluated all 8.x.0 versions between 8.5.0 and 8.15.0.
The test increases the number of pods in a one node cluster at this rhythm: 12, 61, 111, 161, 211, 311, 411, and 511.
I annotated the following results after 5min for each cycle:
Once the EA restarts, I stop registering the tests for the upcoming increase of pods, since the performance is no longer stable.
Using the default configuration from the agent:
resources:
limits:
memory: 500Mi
requests:
cpu: 100m
memory: 200Mi
Results:
Pods | CPU | Memory | EA pod restarts |
---|---|---|---|
12 | 35m | 281Mi | 0 |
61 | 115m | 410Mi | 0 |
111 | 272m | 399Mi | 0 |
161 | 852m | 491Mi | 0 |
211 | 923m | 441Mi | 0 |
311 | 770m | 445Mi | 0 |
411 | 625m | 450Mi | 0 |
511 | 342m | 414Mi | 0 |
Using the default configuration from the agent:
resources:
limits:
memory: 500Mi
requests:
cpu: 100m
memory: 200Mi
Results:
Pods | CPU | Memory | EA pod restarts |
---|---|---|---|
12 | 33m | 407Mi | 0 |
61 | 4 |
No longer works at 61 up.
Using the default configuration from the agent:
resources:
limits:
memory: 500Mi
requests:
cpu: 100m
memory: 200Mi
Results:
Pods | CPU | Memory | EA pod restarts |
---|---|---|---|
12 | 32m | 431Mi | 0 |
61 | 4 |
No longer works at 61 up test.
Using the default configuration from the agent:
resources:
limits:
memory: 700Mi
requests:
cpu: 100m
memory: 400Mi
Results:
Pods | CPU | Memory | EA pod restarts |
---|---|---|---|
12 | 24m | 378Mi | 0 |
61 | 94m | 489Mi | 0 |
111 | 298m | 596Mi | 0 |
161 | 1 |
No longer works at 161 up test.
Using the default configuration from the agent:
resources:
limits:
memory: 700Mi
requests:
cpu: 100m
memory: 400Mi
Results:
Pods | CPU | Memory | EA pod restarts |
---|---|---|---|
12 | 32m | 421Mi | 0 |
61 | 92m | 533Mi | 0 |
111 | 250m | 639Mi | 0 |
161 | 1 |
No longer works at 161 up test.
Using the default configuration from the agent:
resources:
limits:
memory: 700Mi
requests:
cpu: 100m
memory: 400Mi
Results:
Pods | CPU | Memory | EA pod restarts |
---|---|---|---|
12 | 25m | 424Mi | 0 |
61 | 90m | 543Mi | 0 |
111 | 2 |
No longer works at 111 up test.
Using the default configuration from the agent:
resources:
limits:
memory: 700Mi
requests:
cpu: 100m
memory: 400Mi
Results:
Pods | CPU | Memory | EA pod restarts |
---|---|---|---|
12 | 14m | 435Mi | 0 |
61 | 54m | 577Mi | 0 |
111 | 2 |
No longer works at 111 up test.
Using the default configuration from the agent:
resources:
limits:
memory: 700Mi
requests:
cpu: 100m
memory: 400Mi
Results:
Pods | CPU | Memory | EA pod restarts |
---|---|---|---|
12 | 15m | 445Mi | 0 |
61 | 54m | 604Mi | 0 |
111 | 2 |
No longer works at 111 up test.
Using the default configuration from the agent:
resources:
limits:
memory: 700Mi
requests:
cpu: 100m
memory: 400Mi
Results:
Pods | CPU | Memory | EA pod restarts |
---|---|---|---|
12 | 14m | 441Mi | 0 |
61 | 51m | 538Mi | 0 |
111 | 2 |
No longer works at 111 up test.
Using the default configuration from the agent:
resources:
limits:
memory: 700Mi
requests:
cpu: 100m
memory: 400Mi
Results:
Pods | CPU | Memory | EA pod restarts |
---|---|---|---|
12 | 13m | 580Mi | 0 |
61 | 1 |
No longer works at 61 up test.
Using the default configuration from the agent:
resources:
limits:
memory: 700Mi
requests:
cpu: 100m
memory: 400Mi
Results:
Pods | CPU | Memory | EA pod restarts |
---|---|---|---|
12 | 28m | 595Mi | 0 |
61 | 1 |
No longer works at 61 up test.
From 8.5 to 8.6 version, something changed that caused a huge memory increase in the Kubernetes integration, to the point that increasing the number of pods made the agent stop and restart over and over again.
From 8.8 version, the number of pods that made the agent stop increase. This is a good sign, but notice that the default memory limits and requests also increase. This surely helps explain this seemingly better performance.
From 8.9 to 8.10 version, the number of pods that caused the EA to stop and restart decreased again. Something happened again in the Kubernetes integration that affected the agent performance.
From 8.13 to 8.14 version, the number of pods that caused the EA to stop and restart decreased again. Something happened again in the Kubernetes integration that affected the agent performance. Also, from @gizas: 8.13 vs 8.14 is 140Mi diff even with no of 12 pods.
It seems Kubernetes memory usage has been getting higher since 8.5, with notable increases in 8.6, 8.10 and 8.14 (discarding the increase of memory resources by default in EA at 8.8 version that helped hide the possible issues in Kubernetes integration).
@constanca-m Have you also tested if the data is actually send to Elastic? My setup had ~15 pods with 8.15 and the memory ran high and even though the pod itself didn't restart, the K8s data didn't come in (or was very spotty) (I think one of the processes itself was crashing)
Have you also tested if the data is actually send to Elastic?
In my case, I can see data in Discover (I am filtering by kubernetes.container.name
):
I did not analyze the logs to know if everything is being sent there, or we are loosing data. This is the logs from running all the tests in 8.15, including the pod restarts. @EvelienSchellekens
Really useful @constanca-m !
Adding some notes here:
starting
agent memory without k8s integration to see elastic agent's starting memory and be able to calculate as well the k8s integration overhead.
A general comment is that all the above tests just measure the consumption of memory with same k8s load. The identification of memory leak needs watching the trend of memory during time. Just saying that just an increase is not actually bad or good if we observe more k8s resources.
Additionally:
Thank you @gizas.
I think this issue and the scripts to run these tests should be placed somewhere more accessible to the team. Maybe in the future repository you mentioned on Thursday's meeting to help with identifying issues.
for eg in 8.12 I see 61 pods and memory 604Mi but 0 restarts. How is it possible with limit 500 to have more memory?
It says the limits memory is 700Mi for 8.12.
8.13 vs 8.14 is 140Mi diff even with no of 12 pods ?
It looks like it... This was just 1 tests, and values always variate a bit for each test. We could run a test with less increase in pods to capture more the differences between these latest versions.
Edit: but since 8.15 has more or less the same values as 8.14, I believe that we do have a significant difference between 8.13 and 8.14 like you pointed out. Thanks, I will include it in the notes of the original comment as well!
Does above tests include system integration?
Yes. You are correct, we don't include the tests for running just with the System, unfortunately. I agree, with would be good to also have an idea of that, but I don't believe the System here is causing any issues.
In all above tests we dont produce logs right?
This is the hard part! With the agent starting and restarting over and over again... It is very hard, and downloading the diagnosis gets stuck in a loop, and the zip never gets ready. Not sure what is going on there, but I have not payed much attention to it.
Since https://github.com/elastic/elastic-agent/pull/3593 we disable the deployment and cronjob metadata enrichment. Maybe the cluster did not have any cronjobs and the deployments were very few to see any improvement Since https://github.com/elastic/beats/pull/35483 we have introduced the replicaset and Job Metadata generation . Same as above could explain some increase
Correct. Only the default pods, EA, metrics server and the NGINX pod.
I believe the best would be to look at the changelog, see what big changes we had. I can remember the watchers issue, but since that PR has the memory tests there, I don't believe that could cause any influence on the degraded performance, but I could of course be wrong (and biased 😄 ).
@constanca-m the https://github.com/elastic/k8s-integration-infra?tab=readme-ov-file#put-load-on-the-cluster script mentioning in the call. (public repo)
I used a different one @gizas, it is local and more simplified (in the comment of the tests results). I think it should be enough for these tests, and that script for more complex tests.
I also performed some scale tests. I create one node cluster in GKE with ~95 pods running. I tested versions 8.13.0 and 8.14.0 with and without kube-state-metrics to simulate leader and not leader nodes scenarios.
TBH the 700 mb memory limit suffices in both versions. Only in case Kube-state-metrics are enabled I got one restart which means that in big clusters (note that in Kubernetes 110 pods per node is the limit) the memory limit needs some adjustment. Versions 8.13.0 and 8.14.0 do not seem to have big differences. For 8.13 agent's pod memory was around 600Mb , while for 8.14.0 it was around 640Mb. In all cases I used nginx pods but there was really many logs generated.
I don't know why @constanca-m got different results.
Backround
The latest issues like 3863, 3991 and 4081, proved that the installation of the default configuration of Elastic Agent with our Kubernetes Integration can lead to situations were our customers result in unfortunate circumstances (even with broken k8s clusters sometimes). There are many details and variables that affect the final setup and installation of our observability solution and we can try to summarise and list them here.
Goals
This issue tries to summarise the next actions we need in order to investigate:
Actions
Current Actions
We have observed until now that: a) Memory consumption of Elastic Agent had increased from 8.8 to 8.9 versions and later of Elastic Agent (Relevant https://github.com/elastic/sdh-beats/issues/3863#issuecomment-1733750863) b) Number of API calls towards Kubernetes Control API has increased since 8.9 version (See Salesforce 01507229 regarding Elastic Agent overloading Kubernetes API server.: https://github.com/elastic/sdh-beats/issues/3991#issuecomment-1787648161) c) CPU consumption (although not such a big issue at the moment and not first priority) has been referred here as a concern.
Unti now:
Next Planned Actions
Future Plans/Actions