EricDavisX commented 2 years ago

Kibana version: 8.2.0-SNAPSHOT (details below show the problem seems to have started around March 1 2022)

latest commit tested was: ad5c67b44bd6fa2844799522c3ece9bf2cb6e890

Elasticsearch version: same as Kibana

Original install method (e.g. download page, yum, from source, etc.):

deploiyed via cloud, on cloud-proud CFT region. But problem was initially triaged and seen apart from cloud. The issue was first found by the kbn-alert-load Rules Performance tester, by the Response Ops team. I can fwd you a slack message that relates, though I will try to capture the most relevant data here. I can also forward you more details, like the kbn-alert-load jenkins job output that can be re-run on a latest/later-new snapshot easily to re-run the test if we like.

Describe the bug: The test starts up several cloud clusters, and individually assesses their memory usage and other metrics after creating a variable number of rules a variable % of which can create alerts from their execution.

Screenshots (if relevant):

Errors in browser console (if relevant): n/a

Provide logs and/or server output (if relevant): Here are the things from slack that we've already gathered:

Questions asked (and answered, mostly): Is this repeatable or was this a s single run that could be off (I guess not)? Yes, repeatable with the kbn-alert-load tool and locally

Is that also reproducible locally? not applicable to the tool's use, but yes others see memory concerns locally From Brandon K: Anecdotally, I'm seeing the memory usage increase from 8.1 to 8.2 even when there are no alerting rules running. The following are heap usages when starting up a new Kibana instance and logging in: 8.1: 403 MB 8.2: 574 MB

For an RSS increase, the first thing that would be interesting to understand whether it’s an increase in native memory or heap memory. Heap dumps from Brandon K are here: https://drive.google.com/drive/folders/1n0jjJ_H3oEbViMYjd8eKKqqcwM2mbcOX?usp=sharing

NOTE: we suspected possibly a node version? But it does not seem to be the Node version btw, 8.0 uses Node 16.13.2 (see https://github.com/elastic/kibana/blob/8.0/.node-version) as well as main (https://github.com/elastic/kibana/blob/main/.node-version).

Patrick M offered up a script he had used to create many running rules locally, if helpful: https://gist.github.com/pmuellr/f30331660ae032a0a7ccf2767aea3900

EricDavisX commented 2 years ago

I can copy in some additional research as posted in slack, if desired. But I could let Tyler, Mikhail, and Response Ops team members cite their own latest. @kobelb @mshustov @danielmitterdorfer fyi

EricDavisX commented 2 years ago

From Jonathan B: i can't speak for all of ci metrics - but for package testing we first noticed this on march 3rd. this doesn't collect metrics but there's some useful info in the build history

Jonathan Budzenski 3 days ago Mar 03 19:57:57 packer-virtualbox-iso-1561908076 kibana[2313]: [2022-03-03T19:57:57.109+00:00][INFO ][plugins.ruleRegistry] Installed resources for index .alerts-observability.uptime.alerts Mar 03 19:57:57 packer-virtualbox-iso-1561908076 kibana[2313]: [2022-03-03T19:57:57.351+00:00][INFO ][plugins.ruleRegistry] Installed resources for index .preview.alerts-security.alerts Mar 03 19:58:03 packer-virtualbox-iso-1561908076 kibana[2313]: <--- Last few GCs ---> Mar 03 19:58:03 packer-virtualbox-iso-1561908076 kibana[2313]: [2313:0x5170bb0] 28312 ms: Mark-sweep 477.4 (509.6) -> 475.3 (508.1) MB, 673.1 / 0.1 ms (average mu = 0.121, current mu = 0.008) allocation failure GC in old space requested Mar 03 19:58:03 packer-virtualbox-iso-1561908076 kibana[2313]: [2313:0x5170bb0] 29291 ms: Mark-sweep 481.1 (509.8) -> 475.9 (508.6) MB, 968.9 / 0.1 ms (average mu = 0.062, current mu = 0.011) allocation failure scavenge might not succeed Mar 03 19:58:03 packer-virtualbox-iso-1561908076 kibana[2313]: <--- JS stacktrace --

Jonathan Budzenski 3 days ago https://buildkite.com/elastic/kibana-package-testing/builds/180#292b4f9b-a267-4d30-a768-211992b9407c

Jonathan Budzenski 3 days ago coincides with https://github.com/elastic/kibana/pull/124816 #124816 [Security Solution] Initial version of the Cloud Security Posture app… Summary The initial version of the Cloud Security Posture application, hidden behind an advanced setting. Issue: #124806 Labels release_note:skip, backport:skip, v8.2.0, Team:Cloud Security Posture Comments 5 https://github.com/[elastic/kibana](https://github.com/elastic/kibana)|elastic/kibanaelastic/kibana | Feb 7th | Added by GitHub

Jonathan Budzenski 3 days ago i'll see what happens if i revert

EricDavisX commented 2 years ago

From Mikhail Shustov, Tyler and Patrick: @tylersmalley Do you know why APM metrics collected on CI on the 8.1.0 branch have a gap starting from 08 March? I wonder why the 8.1 snapshot (latest) in our test didn’t seem to follow that pattern Maybe a workload during your testing is different from the workload performed at CI? Can you see the difference in the memory size before running your test suit? :eyes:

Patrick Mueller:four_leaf_clover: 1 day ago wrt the Rx/status stuff I noticed in the heap snapshots, I also noticed this PR got merged two weeks ago: https://github.com/elastic/kibana/pull/126320 - and poking around a bit, also this PR - https://github.com/elastic/kibana/pull/126334. And I’ve recently noticed the status messages getting a little twitchy - got this when using curl to create a new connector and then a new rule using that connector: [... 14:22:20.123][INFO ][status] 1 plugins changed status from 'available' to 'unavailable': alerting [... 14:22:20.374][INFO ][status] 1 plugins changed status from 'unavailable' to 'available': alerting [... 14:22:21.965][INFO ][status] 1 plugins changed status from 'available' to 'unavailable': licensing [... 14:22:22.395][INFO ][status] 1 plugins changed status from 'unavailable' to 'available': licensing [... 14:22:35.487][INFO ][status] 1 plugins changed status from 'available' to 'unavailable': fleet [... 14:22:35.737][INFO ][status] 1 plugins changed status from 'unavailable' to 'available': fleet

126320 [Status service] log plugin status changes

https://github.com/elastic/kibana|elastic/kibanaelastic/kibana | Feb 24th | Added by GitHub

126334 PluginsStatusService: reuse plugin status observables

https://github.com/elastic/kibana|elastic/kibanaelastic/kibana | Feb 24th | Added by GitHub

Patrick Mueller:four_leaf_clover: 1 day ago Maybe a workload during your testing is different from the workload performed at CI? Ya, it’s certainly different. So one possibility is our workload is now (indirectly) pulling more code/processing in that it did previously. Had the previous versions also touched that code that it’s (presumably) now touching that it wasn’t before, we wouldn’t see the difference. But the bulk of the difference I saw in the heap snapshots was mostly Observable related.

Tyler Smalley:spiral_calendar_pad: 1 day ago @mikhail.shustov, that's when 8.1.0 was released so the 8.1 branch was bumped to 8.1.1. To make it easier, you could compare labels.branch . :+1::skin-tone-2:

EricDavisX commented 2 years ago

@jbudz did you find anything when you did the revert you were trying?

mshustov commented 2 years ago

You can find a couple of spikes on the RSS chart of APM data collected on Kibana CI (around the end of January and 14.03) 2022-03-18_15-25-03 You can find APM data here

jbudz commented 2 years ago

@jbudz did you find anything when you did the revert you were trying?

No that wasn't the root cause. Still had a similar memory profile.

jbudz commented 2 years ago

Reverting https://github.com/elastic/kibana/pull/126320 appears to fix this.

Now: https://github.com/elastic/kibana/pull/128095 https://kibana-pr-128095.kb.us-west2.gcp.elastic-cloud.com:9243/app/monitoring#/kibana/instances

After reverting: https://github.com/elastic/kibana/pull/128096 https://kibana-pr-128096.kb.us-west2.gcp.elastic-cloud.com:9243/app/monitoring#/kibana/instances

elasticmachine commented 2 years ago

Pinging @elastic/kibana-core (Team:Core)

jbudz commented 2 years ago

I added the blocker label for 8.2.0. Kibana's pretty close to OOM on a fresh cloud installation.

mshustov commented 2 years ago

I checked the memory snapshot with https://github.com/elastic/kibana/pull/126320 to be reverted. There are still a lot of Observables objects. Status service seems to be the root cause of the problem. I made a heap snapshot on a https://github.com/elastic/kibana/pull/128096 with the dev version of Kibana.

As you can see, there are quite a few RxJS created. v8.2 2022-03-21_18-14-44 vs. https://github.com/elastic/kibana/pull/12809 2022-03-21_18-14-56 There are too many MapSubscirber instances to analyze their origin, but lots of the SwitchMapSubscriber instances point to PluginsStatusService.

2022-03-21_18-21-44

I noticed that this logic (with an observable creation) is called ~22000 times on start https://github.com/elastic/kibana/blob/main/src/core/server/status/plugins_status.ts#L132-L142 I'm not familiar with the service well enough. Could someone from the @elastic/kibana-core take a look at the problem, please? If you are interested in analyzing Heap dump, you can use built-in nodejs functionality https://nodejs.org/api/cli.html#--heapsnapshot-signalsignal

mshustov commented 2 years ago

Reverting https://github.com/elastic/kibana/pull/126320 appears to fix this.

Maybe we can merge https://github.com/elastic/kibana/pull/128096? to check whether RSS is back to the norm on APM data gathered from CI https://kibana-ci-apm.kb.us-central1.gcp.cloud.es.io:9243/goto/ac99b0f0-a93f-11ec-94a3-2d0a38b40710

jbudz commented 2 years ago

I pulled #128096 out of draft and added core as a reviewer - whatever's easiest to help investigate. Feel free to merge.

mshustov commented 2 years ago

I pulled https://github.com/elastic/kibana/pull/128096 out of draft and added core as a reviewer

The PR has been merged. I can see the drop in APM metrics, but it doesn't seem we are back to the norm. 3mon. trend 2022-03-22_11-51-52

2d. trend 2022-03-22_11-54-28

EricDavisX commented 2 years ago

[Edited] - I misposted prior, sorry for any confusion. I'm not sure how to test this pre-merge. But I will be watching for afterwards.

mshustov commented 2 years ago

@gsoldevila is working on improving plugin status service memory footprint. I will test his PR and repeat heap snapshot analysis

mshustov commented 2 years ago

@kobelb Having #128324 merged, we can remove the blocker status. With the PR merged, the size of the consumed RSS memory was reduced by almost half, event loop delay improved by nearly 2.5 times. A week dashboard 2022-03-28_09-20-02

According to APM metrics collected at CI, memory usage metrics had rolled back to their all-time lows since January 12, when APM integration on CI was introduced. 3 months dashboard 2022-03-28_09-24-49 2022-03-28_09-25-03

EricDavisX commented 2 years ago

I'm among others I'm sure, but I'm waiting for... a success platform-build to get a new 8.2 snapshot and then a nightly run of the kbn-alert-load tool will pick up the change and show us the improvement (again). Very nice work!

lukeelmers commented 2 years ago

We believe the root cause has been addressed in https://github.com/elastic/kibana/pull/128324, so we will go ahead and close this issue.

@EricDavisX Feel free to reopen or open a new issue if you find there is something we have missed!

elastic / kibana

[Performance] Kibana RSS / Memory usage seems higher in 8.2 vs previous releases (by 50% sometimes) #128061

126320 [Status service] log plugin status changes

126334 PluginsStatusService: reuse plugin status observables