Groovy scripting - possible memory leak ES 2.3.3

bogdanovich commented 8 years ago

Elasticsearch version: 2.3.3

JVM version: java version "1.8.0_40" Java(TM) SE Runtime Environment (build 1.8.0_40-b25) Java HotSpot(TM) 64-Bit Server VM (build 25.40-b25, mixed mode)

OS version: centos 6.3 (x86_64) :: linux 2.6.32-279.el6.x86_64

Description of the problem including expected versus actual behavior:

We upgraded our cluster from 1.4.4 to 2.3.3 and now experiencing an issue where nodes become unresponsive and get heavily garbage collected. A Node restart helps for about 24 hours until 24Gb of RAM get fully utilized again.

Provide logs (if relevant):

jmap -histo:live on the node with 91.1% heap used shows that 4.5Gb is taken by org.codehaus.groovy.runtime.metaclass.MetaMethodIndex

num     #instances         #bytes  class name
----------------------------------------------
  1:      86241328     4829514368  org.codehaus.groovy.runtime.metaclass.MetaMethodIndex$Entry
  2:      66209193     1922638888  [Ljava.lang.Object;
  3:      34089692     1636305216  java.lang.invoke.MethodHandleImpl$CountingWrapper
  4:      62450344     1498808256  org.codehaus.groovy.util.FastArray
  5:      33948832     1086362624  java.util.HashMap$Node
  6:        223427     1063930664  [B
  7:       6938969      999212432  [Lorg.codehaus.groovy.util.ComplexKeyHashMap$Entry;
  8:      27089896      866876672  java.lang.invoke.BoundMethodHandle$Species_LL
  9:      17686899      707475960  java.lang.invoke.BoundMethodHandle$Species_L3
 10:       5818726      612509464  [C
 11:         23980      604964856  [I
 12:      16356260      523400320  org.codehaus.groovy.util.SingleKeyHashMap$Entry
 13:        495640      515466624  [Lorg.codehaus.groovy.runtime.metaclass.MetaMethodIndex$Entry;
 14:       7202607      460966848  java.lang.invoke.BoundMethodHandle$Species_L4IL4
 15:       2020760      383997512  [Ljava.util.HashMap$Node;
 16:       7202607      345725136  java.lang.invoke.BoundMethodHandle$Species_L5
 17:       3003762      264331056  java.lang.reflect.Method
 18:       7884148      252292736  java.lang.invoke.BoundMethodHandle$Species_L
 19:       7434592      237906944  groovy.lang.MetaBeanProperty
 20:       4553471      218566608  java.util.HashMap
 21:       2508224      200657920  java.lang.reflect.Constructor

dakrone commented 8 years ago

I believe this may be related to https://github.com/elastic/elasticsearch/issues/18572 , which is resolved in 2.4.0 by https://github.com/elastic/elasticsearch/pull/18975

rmuir commented 8 years ago

Its more than that though. Don't compile a new unique script with each request!!! Use params !!!!

rmuir commented 8 years ago

users that do this, will continue to hit problems, including jvm crapping out with stuff like https://bugs.openjdk.java.net/browse/JDK-8023191

seriously, the root cause is that the scripting api encourages users to send a new unique script with each request. one symptom was that es had a leak of those tons of generated classes. another symptom, will be that the jdk has a code cache leak, not fixed until java 9. Other symptoms will be slow performance, due to compiler thread pressure and other problems.

But we should address the root cause: there needs to be documentation, limits, something to prevent the anti-pattern.

jasontedor commented 8 years ago

Relates #8632

bogdanovich commented 8 years ago

Would CMSClassUnloadingEnabled flag temporarily fix the problem?

rmuir commented 8 years ago

no.

rmuir commented 8 years ago

The way to prevent the problem, again, is not to send a new unique script with each request. Instead, let the script take parameters, such as my_var, whatever parts are changing for each request, and change those parameters via params. See https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-scripting.html, there are examples doing this.

That means, your script will only be compiled once, and things will be much faster.

bogdanovich commented 8 years ago

@rmuir Why then it was working fine for months in ES 1.4.4 ?

PS. Thanks for clarifying the issue. We indeed use unique scripts every request and will fix that.

rmuir commented 8 years ago

Who knows, perhaps ES 1.4.4 didnt leak? Perhaps it leaked slower, because it used groovy without invokedynamic. Not particularly interesting to me.

CMSClassUnloading is not relevant, you are using java 8, so its already on anyway by default. But CMS never stands a chance to collect the class, because of https://github.com/elastic/elasticsearch/issues/18572

But forget all about that too: because its going to cause all kinds of other problems for your jvm (see my links and comments above), besides being horribly slow.

That's why i'd rather us not pass around a bunch of aspirin trying to treat symptoms, but address the root cause, which is that its too easy to compile a brand new slightly-different script with each request. Obviously the api invites this (thats what happens when an API's efficiency relies on caching!), but that is hard to fix, so for now at least we should try to add some kind of system limit + documentation encouraging people to do it the fast way (using params).

clintongormley commented 8 years ago

The docs for scripting have already been improved (see the Prefer parameters block on https://www.elastic.co/guide/en/elasticsearch/reference/master/modules-scripting-using.html)

We report the number of compilations and cache evictions in the node stats (GET _nodes/stats/script).

The only other thing I can think of doing is to add a safeguard which limits the number of compilations per minute to e.g. 10

Does this sound reasonable?

rmuir commented 8 years ago

Can we make the Prefer parameters a linkable thing? This would help I think, to communicate the issue more easily/directly when it happens. The text is great.

As far as compilations safeguard, I have no idea how it should work, but i do think we need something? At least to catch the most extreme cases before the JVM goes belly-up.

clintongormley commented 8 years ago

Can we make the Prefer parameters a linkable thing? This would help I think, to communicate the issue more easily/directly when it happens. The text is great.

Success! https://www.elastic.co/guide/en/elasticsearch/reference/master/modules-scripting-using.html#prefer-params

As far as compilations safeguard, I have no idea how it should work, but i do think we need something? At least to catch the most extreme cases before the JVM goes belly-up.

Is it the frequency of compilations or just the total number of compilations causing the problem here? If the former, then we could add some rate-limiting. If the latter, then the only way of resolving the issue would be to restart a node (which users have to do at the moment anyway). Should we just be deferring this until 5.0 and Painless? We're not going to make a breaking change in 2.x now.

rmuir commented 8 years ago

Thanks for making that link, I will use it.

Is it the frequency of compilations or just the total number of compilations causing the problem here? If the former, then we could add some rate-limiting. If the latter, then the only way of resolving the issue would be to restart a node (which users have to do at the moment anyway). Should we just be deferring this until 5.0 and Painless? We're not going to make a breaking change in 2.x now.

Absolutely we should defer, we don't need to rush anything in, I just want us to think about addressing it.

Mainly, I think we want to catch the anti-pattern where its a brand-new-script-per-request vs using params. It doesn't have to be a hard limit, it could be a warning in the logs at a start, i mean we should just try to do something. I think the way the caching works is very unintuitive.

In some cases, e.g. expressions, doing this might not be such of a problem. Expressions makes a tiny method without loops, and probably will never cause an issue, even with the java bugs. Expressions was actually discussed around the bug in question:

No loops means no OSR compilation of that bytecode but if it's called
millions of times (based on what you said above) then it'll get compiled.
It may be that your bytecode generates really small nmethods (you mentioned
it's simple algebraic expressions) and their generation naturally doesn't
outpace code cache cleaning.

On the other hand groovy is more complicated and heavy-duty. Features like groovy closures call out to reflection, in slow ways, and are heavy duty. You can see some of that stuff on the histogram on this issue.

Painless? We don't know yet, someone should benchmark it. It uses java's lambda support, which is faster/more lightweight than groovy closures. Maybe its ok to compile jazillions: maybe its not.

But even if its "ok", it does not give the user good performance. It is faster to avoid doing this stuff on every request, always.

dakrone commented 8 years ago

As far as compilations safeguard, I have no idea how it should work, but i do think we need something? At least to catch the most extreme cases before the JVM goes belly-up.

I have an idea for this for a "rate-limiting" circuit breaker, where we could limit script compilation to something like 10 a minute (or whatever limit) or throw a circuit breaking exception with a helpful message. Maybe that would help in this situation?

bogdanovich commented 8 years ago

@dakrone Someone could have 100 different scripts used in their app, and all of them could be compiled during first minute app is running . That would cause circuit beaker exception if it would be implemented the way you are suggesting.

Maybe requests/compilations ratio could be helpful.

dakrone commented 8 years ago

Someone could have 100 different scripts used in their app, and all of them could be compiled during first minute app is running . That would cause circuit beaker exception if it would be implemented the way you are suggesting.

It might be possible to thread a flag through for dynamic scripts versus disk-based scripts, so that disk-based scripts are exempt.

Still, 100 different scripts seems like a lot! Once a user gets to that level they may want to restructure their data so they can avoid using scripting so much.

clintongormley commented 8 years ago

Proposal: For inline scripts which have no params specified, apply a unique-compilations-per-minute limit (eg 10 per minute). If the limit is breached, delay compilation by eg 1 second, and return a Warning: header with the response.

rmuir commented 8 years ago

Wouldn't that just make things slower?

clintongormley commented 8 years ago

Yes - that's the intention here. To provide an incentive for users to not hardcode params (along with informing them why their query is slow). The alternative is simply to throw an exception if the limit is breached. Either way, limiting it to inline scripts without a params clause (even if no params are needed) limits the impact of this change I think

rmuir commented 8 years ago

Well, my opinion: I want things to be fast, hence raising the whole discussion about the thing :)

I'm concerned that this warning will not be noticed (which is ok), but then we will make things even slower.

A lot of ES scripting is pretty damn slow, I'm not sure a 1 second delay is going to grab anyone's attention in the way we want.

I think its ok to just log a warning in the system logs and send your warning header too? Just as a start: nothing risky, we are just trying to grab your attention.

clintongormley commented 8 years ago

Fair enough. I have high hopes for these Warning: headers... just need to get them exposed in the clients

dakrone commented 8 years ago

I agree with @rmuir, I think delaying and returning a warning are likely to be ignored. I'm in favor of a hard limit, throw a helpful exception when it's exceeded that forces a user to take action, whether that action is increasing the breaker limit, or switching the way that scripts are sent.

s1monw commented 8 years ago

@dakrone are you working on this? I removed discuss and added adopt me, feel free to change or assign

dakrone commented 8 years ago

@s1monw I'm currently working on the breaker for buckets and the translog tool, but I'm planning to work on this after those are done. I'll leave this in "adoptme" until then in case someone wants to work on it before I get to it though.

nik9000 commented 8 years ago

I think this was solved in #19694 (targeting 5.0). It creates a breaker that'll trip and fail the request quickly if you attempt to compile too many inline scripts. The steady-state limit is 15 scripts a minute. It'll fail if you compile 16 scripts really, really fast but after four seconds it'll let another one through, then another after four seconds, etc. It just won't ever let more than 15 through at a time no matter how many you've "save up" by not compiling a script every four seconds.

Anyway, that should make issues like this super obvious.

elastic / elasticsearch

Groovy scripting - possible memory leak ES 2.3.3 #19396