Closed EricDavisX closed 2 years ago
I can copy in some additional research as posted in slack, if desired. But I could let Tyler, Mikhail, and Response Ops team members cite their own latest. @kobelb @mshustov @danielmitterdorfer fyi
From Jonathan B: i can't speak for all of ci metrics - but for package testing we first noticed this on march 3rd. this doesn't collect metrics but there's some useful info in the build history
Jonathan Budzenski 3 days ago Mar 03 19:57:57 packer-virtualbox-iso-1561908076 kibana[2313]: [2022-03-03T19:57:57.109+00:00][INFO ][plugins.ruleRegistry] Installed resources for index .alerts-observability.uptime.alerts Mar 03 19:57:57 packer-virtualbox-iso-1561908076 kibana[2313]: [2022-03-03T19:57:57.351+00:00][INFO ][plugins.ruleRegistry] Installed resources for index .preview.alerts-security.alerts Mar 03 19:58:03 packer-virtualbox-iso-1561908076 kibana[2313]: <--- Last few GCs ---> Mar 03 19:58:03 packer-virtualbox-iso-1561908076 kibana[2313]: [2313:0x5170bb0] 28312 ms: Mark-sweep 477.4 (509.6) -> 475.3 (508.1) MB, 673.1 / 0.1 ms (average mu = 0.121, current mu = 0.008) allocation failure GC in old space requested Mar 03 19:58:03 packer-virtualbox-iso-1561908076 kibana[2313]: [2313:0x5170bb0] 29291 ms: Mark-sweep 481.1 (509.8) -> 475.9 (508.6) MB, 968.9 / 0.1 ms (average mu = 0.062, current mu = 0.011) allocation failure scavenge might not succeed Mar 03 19:58:03 packer-virtualbox-iso-1561908076 kibana[2313]: <--- JS stacktrace --
Jonathan Budzenski 3 days ago https://buildkite.com/elastic/kibana-package-testing/builds/180#292b4f9b-a267-4d30-a768-211992b9407c
Jonathan Budzenski 3 days ago coincides with https://github.com/elastic/kibana/pull/124816 #124816 [Security Solution] Initial version of the Cloud Security Posture app… Summary The initial version of the Cloud Security Posture application, hidden behind an advanced setting. Issue: #124806 Labels release_note:skip, backport:skip, v8.2.0, Team:Cloud Security Posture Comments 5 https://github.com/[elastic/kibana](https://github.com/elastic/kibana)|elastic/kibanaelastic/kibana | Feb 7th | Added by GitHub
Jonathan Budzenski 3 days ago i'll see what happens if i revert
From Mikhail Shustov, Tyler and Patrick: @tylersmalley Do you know why APM metrics collected on CI on the 8.1.0 branch have a gap starting from 08 March? I wonder why the 8.1 snapshot (latest) in our test didn’t seem to follow that pattern Maybe a workload during your testing is different from the workload performed at CI? Can you see the difference in the memory size before running your test suit? :eyes:
Patrick Mueller:four_leaf_clover: 1 day ago wrt the Rx/status stuff I noticed in the heap snapshots, I also noticed this PR got merged two weeks ago: https://github.com/elastic/kibana/pull/126320 - and poking around a bit, also this PR - https://github.com/elastic/kibana/pull/126334. And I’ve recently noticed the status messages getting a little twitchy - got this when using curl to create a new connector and then a new rule using that connector: [... 14:22:20.123][INFO ][status] 1 plugins changed status from 'available' to 'unavailable': alerting [... 14:22:20.374][INFO ][status] 1 plugins changed status from 'unavailable' to 'available': alerting [... 14:22:21.965][INFO ][status] 1 plugins changed status from 'available' to 'unavailable': licensing [... 14:22:22.395][INFO ][status] 1 plugins changed status from 'unavailable' to 'available': licensing [... 14:22:35.487][INFO ][status] 1 plugins changed status from 'available' to 'unavailable': fleet [... 14:22:35.737][INFO ][status] 1 plugins changed status from 'unavailable' to 'available': fleet
https://github.com/elastic/kibana|elastic/kibanaelastic/kibana | Feb 24th | Added by GitHub
https://github.com/elastic/kibana|elastic/kibanaelastic/kibana | Feb 24th | Added by GitHub
Patrick Mueller:four_leaf_clover: 1 day ago Maybe a workload during your testing is different from the workload performed at CI? Ya, it’s certainly different. So one possibility is our workload is now (indirectly) pulling more code/processing in that it did previously. Had the previous versions also touched that code that it’s (presumably) now touching that it wasn’t before, we wouldn’t see the difference. But the bulk of the difference I saw in the heap snapshots was mostly Observable related.
Tyler Smalley:spiral_calendar_pad: 1 day ago @mikhail.shustov, that's when 8.1.0 was released so the 8.1 branch was bumped to 8.1.1. To make it easier, you could compare labels.branch . :+1::skin-tone-2:
@jbudz did you find anything when you did the revert you were trying?
You can find a couple of spikes on the RSS chart of APM data collected on Kibana CI (around the end of January and 14.03) You can find APM data here
@jbudz did you find anything when you did the revert you were trying?
No that wasn't the root cause. Still had a similar memory profile.
Reverting https://github.com/elastic/kibana/pull/126320 appears to fix this.
Now: https://github.com/elastic/kibana/pull/128095 https://kibana-pr-128095.kb.us-west2.gcp.elastic-cloud.com:9243/app/monitoring#/kibana/instances
After reverting: https://github.com/elastic/kibana/pull/128096 https://kibana-pr-128096.kb.us-west2.gcp.elastic-cloud.com:9243/app/monitoring#/kibana/instances
Pinging @elastic/kibana-core (Team:Core)
I added the blocker label for 8.2.0. Kibana's pretty close to OOM on a fresh cloud installation.
I checked the memory snapshot with https://github.com/elastic/kibana/pull/126320 to be reverted. There are still a lot of Observables
objects.
Status service seems to be the root cause of the problem. I made a heap snapshot on a https://github.com/elastic/kibana/pull/128096 with the dev version of Kibana.
As you can see, there are quite a few RxJS created.
v8.2
vs. https://github.com/elastic/kibana/pull/12809
There are too many MapSubscirber
instances to analyze their origin, but lots of the SwitchMapSubscriber
instances point to PluginsStatusService
.
I noticed that this logic (with an observable creation) is called ~22000 times on start https://github.com/elastic/kibana/blob/main/src/core/server/status/plugins_status.ts#L132-L142 I'm not familiar with the service well enough. Could someone from the @elastic/kibana-core take a look at the problem, please? If you are interested in analyzing Heap dump, you can use built-in nodejs functionality https://nodejs.org/api/cli.html#--heapsnapshot-signalsignal
Reverting https://github.com/elastic/kibana/pull/126320 appears to fix this.
Maybe we can merge https://github.com/elastic/kibana/pull/128096? to check whether RSS is back to the norm on APM data gathered from CI https://kibana-ci-apm.kb.us-central1.gcp.cloud.es.io:9243/goto/ac99b0f0-a93f-11ec-94a3-2d0a38b40710
I pulled #128096 out of draft and added core as a reviewer - whatever's easiest to help investigate. Feel free to merge.
I pulled https://github.com/elastic/kibana/pull/128096 out of draft and added core as a reviewer
The PR has been merged. I can see the drop in APM metrics, but it doesn't seem we are back to the norm. 3mon. trend
2d. trend
[Edited] - I misposted prior, sorry for any confusion. I'm not sure how to test this pre-merge. But I will be watching for afterwards.
@gsoldevila is working on improving plugin status service memory footprint. I will test his PR and repeat heap snapshot analysis
@kobelb Having #128324 merged, we can remove the blocker
status. With the PR merged, the size of the consumed RSS memory was reduced by almost half, event loop delay improved by nearly 2.5 times.
A week dashboard
According to APM metrics collected at CI, memory usage
metrics had rolled back to their all-time lows since January 12, when APM integration on CI was introduced.
3 months dashboard
I'm among others I'm sure, but I'm waiting for... a success platform-build to get a new 8.2 snapshot and then a nightly run of the kbn-alert-load tool will pick up the change and show us the improvement (again). Very nice work!
We believe the root cause has been addressed in https://github.com/elastic/kibana/pull/128324, so we will go ahead and close this issue.
@EricDavisX Feel free to reopen or open a new issue if you find there is something we have missed!
Kibana version: 8.2.0-SNAPSHOT (details below show the problem seems to have started around March 1 2022)
Elasticsearch version: same as Kibana
Original install method (e.g. download page, yum, from source, etc.):
Describe the bug: The test starts up several cloud clusters, and individually assesses their memory usage and other metrics after creating a variable number of rules a variable % of which can create alerts from their execution.
Screenshots (if relevant):
Errors in browser console (if relevant): n/a
Provide logs and/or server output (if relevant): Here are the things from slack that we've already gathered:
Questions asked (and answered, mostly): Is this repeatable or was this a s single run that could be off (I guess not)? Yes, repeatable with the kbn-alert-load tool and locally
Is that also reproducible locally? not applicable to the tool's use, but yes others see memory concerns locally From Brandon K: Anecdotally, I'm seeing the memory usage increase from 8.1 to 8.2 even when there are no alerting rules running. The following are heap usages when starting up a new Kibana instance and logging in: 8.1: 403 MB 8.2: 574 MB
For an RSS increase, the first thing that would be interesting to understand whether it’s an increase in native memory or heap memory. Heap dumps from Brandon K are here: https://drive.google.com/drive/folders/1n0jjJ_H3oEbViMYjd8eKKqqcwM2mbcOX?usp=sharing
NOTE: we suspected possibly a node version? But it does not seem to be the Node version btw, 8.0 uses Node 16.13.2 (see https://github.com/elastic/kibana/blob/8.0/.node-version) as well as main (https://github.com/elastic/kibana/blob/main/.node-version).
Patrick M offered up a script he had used to create many running rules locally, if helpful: https://gist.github.com/pmuellr/f30331660ae032a0a7ccf2767aea3900