Kibana RSS memory spike seen in performance test

EricDavisX commented 2 years ago

Cluster version & info: 8.3.0-SNAPSHOT is where the memory usage spike is seen. Note, it was there, but not as bad until about 25 days back (mid-April). The test is run against CFT (us-west2) region of GCP.

Original install method (e.g. download page, yum, from source, etc.): The kbn-alert-load test runs multiple versions in cloud and allows team to visually compare. See screenshot below.

Describe the bug: The memory spike early in the test (2 minutes in) after all rules are created and the processing of all 1000 of them is in full force.

Steps to reproduce:

on a fresh install, create 1000 rules a via scripting with 50 percent of them firing actions and wait 2 mins to monitor memory

The kbn-alert-load tool can be used to help test / validate this, though not easily against branch PR code.

Expected behavior: The memory spikes, but only up to around 800 million bytes. This is where we saw performance prior. It is noted that the default cloud memory cap for Kibana is right at 800 million bytes.

Screenshots (if relevant): kib-rss-mem-spike

Errors in browser console (if relevant): n/a

Provide logs and/or server output (if relevant): the test is in CFT, so we don't have Kibana logs yet - we can tweak execution to longer and then inspect (and capture) logs live, help is appreciated - ping the Resp Ops team as needed.

Any additional context: In Slack, Jonathan Budzenski had asked for the issue and to ping all 3 teams.

elasticmachine commented 2 years ago

Pinging @elastic/kibana-core (Team:Core)

elasticmachine commented 2 years ago

Pinging @elastic/kibana-operations (Team:Operations)

elasticmachine commented 2 years ago

Pinging @elastic/response-ops (Team:ResponseOps)

rudolf commented 2 years ago

Do we have historical records so we can pinpoint which build between 55 and 83 introduced the regression? That would make it significantly easier to identify potential changes that could have caused it.

I would suggest narrowing it down as much as possible and then taking a heap snapshot on both versions e.g. 1 minute into the test. Chrome's debugging tools allow you to compare two heap snapshots which would help us identify what's filling up the memory with the extra ~200MB and hopefully why that started happening.

rudolf commented 2 years ago

The test is run against CFT (us-west2) region of GCP.

How much heap was Kibana given in these tests (--max-old-space-size)?

EricDavisX commented 2 years ago

I have dug in more thoroughly, and can cite the problem actually started earlier than I first reported, sorry:

build 45 on April 11 - snapshot code shows neutral RSS memory usage (except for 8.1.2 GA, but nvm) - all is well.
build 47 on April 12 - shows 8.2-snap with a spike over 800Megs at 3m into test (8.1.2 still anomalous)
build 49 on April 13 - shows 8.2-snap and 7.17.2GA (?) spike well over 900Megs at minute 1; and 8.3-snap spikes to approx 800Megs at 8 mins in
build 50 on April 14 - 8.3-snap spikes over 1000Megs at 1 min in, and has 4 more 900Megs-ish spikes at 3,4,5, and 6 mins
build 53/54 on April 14 also show some spikes and build 55 looked ok in comparison (without history) as 8.3 and 8.1.2 were both higher at around 800Megs
from build 56 and on from April 15 forwards, it looks like 8.3-snap spikes to at or above 1000Megs consistent at minutes 1-3 in test, and other versions usually do not. 8.2-snap, particularly of note, does not usually spike as well, perhaps indicating a relevant difference of code base between the 2 development versions.

So, you can interpret that. I'd start looking around April 11 for Kibana check-ins, and I don't know if we can pin any of this against cloud-side factors (or just non-kibana factors) but it somewhat steadily starts to ramp over a few days with more spikes consistently from April 14 and on to present day (May 12, currently).

Interestingly, 8.1.2 (a month ago) was seen as hitting at or over 800Megs for this test, which it hadn't before, and its GA out in the wild, if we haven't seen major performance impact since then, perhaps it is not as critical as it could seem? I would like others to evaluate the severity of this, too - just offering that up.

Daniel M pointed me to this node.js api: https://nodejs.org/dist/latest-v15.x/docs/api/v8.html#v8_v8_getheapsnapshot But, I'm not sure if we have any easy way to put that into action or otherwise get the Heap. I'm out of my depth. @pmuellr @mikecote do you want or can you volunteer someone on the team to help us get basic debug info to push it onwards? I can do the legwork to get kibana logs if we really want / need them... maybe setting that in code in the cloud deployment (self-monitoring) may be the way to go? Or setting a delay start time in the test to allow us to set it up manually as a hack.

pictures: attached build 45/april 11 and a zip of more...

kib-perf-images.zip

pmuellr commented 2 years ago

Using heap snapshots is going to be hard. You need to arrange at least two, usually three are better - before, during, and after the spike. And the analysis tools for heap snapshots is ... difficult to use. It's easier with a memory leak, where you really just need two snapshots. I'm not hopeful we could collect snapshots that would indicate the spikes - or if we did, we'd be able to figure out what the spikes are from (again, the tooling for this is not great).

I know there are some more "live" heap profiling tools available, but am unfamiliar with them. For instance, the "sampling" APIs here - https://chromedevtools.github.io/devtools-protocol/v8/HeapProfiler. If this turned out to return interesting information, and not affect performance too much, we could arrange a Kibana config option to enable this, and generate messages about "hot" things it's seen.

EricDavisX commented 2 years ago

I was wondering what the difference in spike/usage was from this and the prior issue logged: https://github.com/elastic/kibana/issues/128061

it seemed like this problem was only evident after that larger problem was resolved.

From Patrick's comments above can someone from Kibana team push this on? It doesn't seem related to changes in Response Ops code.

rudolf commented 2 years ago

It doesn't seem related to changes in Response Ops code.

I don't think we have any data that gives a hint as to the cause. But I can help investigate this, I would just need more detailed instructions to reproduce.

on a fresh install, create 1000 rules a via scripting with 50 percent of them firing actions and wait 2 mins to monitor memory

If using kbn-alert-load what test suite should I run?

EricDavisX commented 2 years ago

@rudolf thanks so much. Below is the run string is what the Jenkins run that shows the concerns uses: ./kbn-alert-load -M 15 --percentFiring 50 run stack-versions-1000 --region gcp-us-west2

It's a 15 min test, with 50 % rules Firing actions, and 1000 rules created total. it needs to run in gcp-us-west2 on prod to allow for latest snapshot usage. or you can modify the run as desired to hit staging. the details are doc'ed in the repo but feel free to ask any specifics you wish here or in slack.

EricDavisX commented 2 years ago

Now that I've thought on this... I haven't used it, but I think the security team added the ability to run the cluster locally which is what we'd want right? let's sync up offline.

rudolf commented 2 years ago

I tested this locally on macOS running Kibana with Node's --trace-gc flag (so we don't miss the real memory consumption spikes due to sampling) and running kbn-alert-load with:

./kbn-alert-load.js -M 15 --percentFiring 50 run stack-versions-1000 -e http://elastic:changeme@localhost:9200 -k http://elastic:changeme@localhost:5601

I noticed when I start Kibana and kbn-alert-load at the same time memory usage spikes quite high, but when I run kbn-alert-load a second time against an already running Kibana server memory stays really low. I then just restarted the Kibana server and even without kbn-alert-load memory spikes high just after Core starts all plugins. The amount of memory used varied significantly between runs but it was roughly the same between 8.2 and main.

Looking at the GC logs I also noticed that sometimes a single Mark-sweep would reduce the heap used by 335MB

[23128:0x130008000]     9131 ms: Mark-sweep 685.2 (724.0) -> 350.2 (658.7) MB, 18.2 / 3.8 ms  (+ 25.2 ms in 274 steps since start of marking, biggest step 5.2 ms, walltime since start of marking 118 ms) (average mu = 0.991, current mu = 0.991) finalize incremental marking via stack guard GC in old space requested

This made me suspect the GC has the luxury to just not try very hard to release memory. On both 8.2 and main I was able to run Kibana with kbn-alert-load using just 500MB of heap.

So in conclusion, I don't think there's a significant difference in the memory consumption between 8.2 and main. To get more accurate metrics for our performance tests, we should collect the memory consumption after a GC.

elastic / kibana

Kibana RSS memory spike seen in performance test #131950