elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.57k stars 8.09k forks source link

Kibana RSS memory spike seen in performance test #131950

Closed EricDavisX closed 2 years ago

EricDavisX commented 2 years ago

Cluster version & info: 8.3.0-SNAPSHOT is where the memory usage spike is seen. Note, it was there, but not as bad until about 25 days back (mid-April). The test is run against CFT (us-west2) region of GCP.

Original install method (e.g. download page, yum, from source, etc.): The kbn-alert-load test runs multiple versions in cloud and allows team to visually compare. See screenshot below.

Describe the bug: The memory spike early in the test (2 minutes in) after all rules are created and the processing of all 1000 of them is in full force.

Steps to reproduce:

  1. on a fresh install, create 1000 rules a via scripting with 50 percent of them firing actions and wait 2 mins to monitor memory

The kbn-alert-load tool can be used to help test / validate this, though not easily against branch PR code.

Expected behavior: The memory spikes, but only up to around 800 million bytes. This is where we saw performance prior. It is noted that the default cloud memory cap for Kibana is right at 800 million bytes.

Screenshots (if relevant): kib-rss-mem-spike

Errors in browser console (if relevant): n/a

Provide logs and/or server output (if relevant): the test is in CFT, so we don't have Kibana logs yet - we can tweak execution to longer and then inspect (and capture) logs live, help is appreciated - ping the Resp Ops team as needed.

Any additional context: In Slack, Jonathan Budzenski had asked for the issue and to ping all 3 teams.

elasticmachine commented 2 years ago

Pinging @elastic/kibana-core (Team:Core)

elasticmachine commented 2 years ago

Pinging @elastic/kibana-operations (Team:Operations)

elasticmachine commented 2 years ago

Pinging @elastic/response-ops (Team:ResponseOps)

rudolf commented 2 years ago

Do we have historical records so we can pinpoint which build between 55 and 83 introduced the regression? That would make it significantly easier to identify potential changes that could have caused it.

I would suggest narrowing it down as much as possible and then taking a heap snapshot on both versions e.g. 1 minute into the test. Chrome's debugging tools allow you to compare two heap snapshots which would help us identify what's filling up the memory with the extra ~200MB and hopefully why that started happening.

rudolf commented 2 years ago

The test is run against CFT (us-west2) region of GCP.

How much heap was Kibana given in these tests (--max-old-space-size)?

EricDavisX commented 2 years ago

I have dug in more thoroughly, and can cite the problem actually started earlier than I first reported, sorry:

So, you can interpret that. I'd start looking around April 11 for Kibana check-ins, and I don't know if we can pin any of this against cloud-side factors (or just non-kibana factors) but it somewhat steadily starts to ramp over a few days with more spikes consistently from April 14 and on to present day (May 12, currently).

Interestingly, 8.1.2 (a month ago) was seen as hitting at or over 800Megs for this test, which it hadn't before, and its GA out in the wild, if we haven't seen major performance impact since then, perhaps it is not as critical as it could seem? I would like others to evaluate the severity of this, too - just offering that up.

Daniel M pointed me to this node.js api: https://nodejs.org/dist/latest-v15.x/docs/api/v8.html#v8_v8_getheapsnapshot But, I'm not sure if we have any easy way to put that into action or otherwise get the Heap. I'm out of my depth. @pmuellr @mikecote do you want or can you volunteer someone on the team to help us get basic debug info to push it onwards? I can do the legwork to get kibana logs if we really want / need them... maybe setting that in code in the cloud deployment (self-monitoring) may be the way to go? Or setting a delay start time in the test to allow us to set it up manually as a hack.

pictures: attached build 45/april 11 and a zip of more...

april11-build-45

kib-perf-images.zip

pmuellr commented 2 years ago

Using heap snapshots is going to be hard. You need to arrange at least two, usually three are better - before, during, and after the spike. And the analysis tools for heap snapshots is ... difficult to use. It's easier with a memory leak, where you really just need two snapshots. I'm not hopeful we could collect snapshots that would indicate the spikes - or if we did, we'd be able to figure out what the spikes are from (again, the tooling for this is not great).

I know there are some more "live" heap profiling tools available, but am unfamiliar with them. For instance, the "sampling" APIs here - https://chromedevtools.github.io/devtools-protocol/v8/HeapProfiler. If this turned out to return interesting information, and not affect performance too much, we could arrange a Kibana config option to enable this, and generate messages about "hot" things it's seen.

EricDavisX commented 2 years ago

I was wondering what the difference in spike/usage was from this and the prior issue logged: https://github.com/elastic/kibana/issues/128061

From Patrick's comments above can someone from Kibana team push this on? It doesn't seem related to changes in Response Ops code.

rudolf commented 2 years ago

It doesn't seem related to changes in Response Ops code.

I don't think we have any data that gives a hint as to the cause. But I can help investigate this, I would just need more detailed instructions to reproduce.

on a fresh install, create 1000 rules a via scripting with 50 percent of them firing actions and wait 2 mins to monitor memory

If using kbn-alert-load what test suite should I run?

EricDavisX commented 2 years ago

@rudolf thanks so much. Below is the run string is what the Jenkins run that shows the concerns uses: ./kbn-alert-load -M 15 --percentFiring 50 run stack-versions-1000 --region gcp-us-west2

It's a 15 min test, with 50 % rules Firing actions, and 1000 rules created total. it needs to run in gcp-us-west2 on prod to allow for latest snapshot usage. or you can modify the run as desired to hit staging. the details are doc'ed in the repo but feel free to ask any specifics you wish here or in slack.

EricDavisX commented 2 years ago

Now that I've thought on this... I haven't used it, but I think the security team added the ability to run the cluster locally which is what we'd want right? let's sync up offline.

rudolf commented 2 years ago

I tested this locally on macOS running Kibana with Node's --trace-gc flag (so we don't miss the real memory consumption spikes due to sampling) and running kbn-alert-load with:

./kbn-alert-load.js -M 15 --percentFiring 50 run stack-versions-1000 -e http://elastic:changeme@localhost:9200 -k http://elastic:changeme@localhost:5601

I noticed when I start Kibana and kbn-alert-load at the same time memory usage spikes quite high, but when I run kbn-alert-load a second time against an already running Kibana server memory stays really low. I then just restarted the Kibana server and even without kbn-alert-load memory spikes high just after Core starts all plugins. The amount of memory used varied significantly between runs but it was roughly the same between 8.2 and main.

Looking at the GC logs I also noticed that sometimes a single Mark-sweep would reduce the heap used by 335MB

[23128:0x130008000]     9131 ms: Mark-sweep 685.2 (724.0) -> 350.2 (658.7) MB, 18.2 / 3.8 ms  (+ 25.2 ms in 274 steps since start of marking, biggest step 5.2 ms, walltime since start of marking 118 ms) (average mu = 0.991, current mu = 0.991) finalize incremental marking via stack guard GC in old space requested

This made me suspect the GC has the luxury to just not try very hard to release memory. On both 8.2 and main I was able to run Kibana with kbn-alert-load using just 500MB of heap.

So in conclusion, I don't think there's a significant difference in the memory consumption between 8.2 and main. To get more accurate metrics for our performance tests, we should collect the memory consumption after a GC.