Fix unnecessary cache evictions when setting indices.fielddata.cache.size

craigwi commented 10 years ago

As described in user group posts including https://groups.google.com/d/msgid/elasticsearch/07cfb705-1f13-40d5-b7b2-c5b84e328ddb%40googlegroups.com and https://groups.google.com/d/msgid/elasticsearch/14703918-0450-42c1-810b-edf3967951b6%40googlegroups.com

Guava needs some changes to better utilize ram when using eviction by size (e.g., when indices.fielddata.cache.size is set in ES).

Proposed changes to Guava are here https://code.google.com/r/craigwi-guava/.

Update: also opened Guava issue: https://code.google.com/p/guava-libraries/issues/detail?id=1854.

clintongormley commented 10 years ago

@mikemccand any ideas about this?

mikemccand commented 10 years ago

Sorry @clintongormley I don't have any insights here.

kimchy commented 10 years ago

I looked at it, and I suggest we wait and see how things progress on guava front to address it. I agree with the note that contention on a single AtomicLong can be tricky, though might not be that tricky for non write heavy cases. The global eviction queue suggested there is interesting....

kimchy commented 10 years ago

as a side note, we could try and use murmur hash (HPPC has a nice 32bit one) on the hashCode in IndicesFieldDataCache#Key class to try and have better hash distribution? @jpountz / @s1monw what do you think? (if we do so, might make sense to do it on other guava cache entries)

plaflamme commented 10 years ago

Would it be possible to allow configuring the number of partitions ES uses? Reducing the number of partitions would be safer than increasing the maximum size of the cache to patch this until guava has an actual solution.

kimchy commented 10 years ago

@plaflamme yea, we can add it as an option to configure, though I don't like that its such an expert setting

craigwi commented 10 years ago

The issue arises because of the combination of the hash function distribution AND the distribution of the sizes of the elements -- I'm not sure this is solvable with just a new hash function (which might be valuable for other reasons ...).

craigwi commented 10 years ago

I thought of another potential mitigation in case guava doesn't take some kind of change like I'm proposing. If the variation in the size of elements was reduced, the cache memory utilization should go up. In the case I debugged, the item causing the problem was loaded in ValuesSource.load via the call to indexFieldData.load. The items loaded in my case were in the 100mb range and the cache segments were only 134mb. Due to the hash keys, two of these large items would wind up in one cache segment and given the sizes, only one could remain.

If there were some way to break up that structure into smaller pieces (and of course not slow other things down...), that should increase the cache memory utilization.

nik9000 commented 10 years ago

The items loaded in my case were in the 100mb range and the cache segments were only 134mb.

I'm not close to the situation but does it make sense to cache things that take up so much of the cache?

It feels silly, but maybe a single segment cache for things over some size and the multi segment cache for smaller things? It just feels like things so large can't have much contention on them because there just isn't enough heap.

jpountz commented 10 years ago

@kimchy Guava already applies a secondary hash to spread the hashCode() bits in LocalCache.rehash so I don't think we need to do anything more on our side.

s1monw commented 10 years ago

@craigwi we can expose the concurrency level as a setting if that would help until we have a better cache impl that doesn't suffer from this problem?

plaflamme commented 10 years ago

@s1monw That would definitely help in the interim. Sadly, Guava doesn't expose write contention metrics, they would have been useful to measure the impact of changing this setting.

s1monw commented 10 years ago

I opened a PR to add this setting to all caches we have

craigwi commented 10 years ago

@plaflamme, I agree that the write contention on the field data cache be measured to determine any impact from reducing the concurrency level.

@s1monw, regarding exposing a setting to control this, I do NOT recommend this. I sent a note to @kimchy through ZenDesk to this effect. I have suggested to Shay that, after verifying the actual write contention, you set the concurrency level to 1 automatically on the field data cache (either always or only when size-based eviction is used).

Related, I'd like to understand why the default in ES for field data cache is NO eviction policy. It would seem that if sized-based eviction worked correctly, this would be a better default.

plaflamme commented 10 years ago

I agree with the fact that there's no reason to change this value unless you're using sized-based eviction. In which case, you'll probably set this to 1 so that your cap setting is correctly used.

Also, I suspect that the gain of NOT eagerly evicting field data cache entries will outweigh any cost of additional write contention: doing disk IO will always be more expensive than doing more locking (unless the IO happens under the lock).

So setting this to 1 by default when size-based eviction is used would make sense.

Alternatively, you could set it to 2 and overcommit each partition by a small factor. That would already cut write contention by half (assuming the keys are uniformly distributed).

s1monw commented 10 years ago

I think we have 2 distinct problems here or at least taht I try to solve.

make it possible to work around potential issues of defaults by setting the concurrency level which can help fixing the problem until we have a good / better default
finding a good default value or even default implementation for this cache

the latter will likely not make it into 1.4 GA so I think we should go with 1 as an undocumented option and fix it properly in upcoming releases and even think about an alternative cache impl that doesn't use segment based locking. To me the FieldData Cache is more or less single writer multiple reader or that is at least what we wanna optimize for. the cache we use us pretty much a default cache to be sufficient for write intensive application which we are not. We might be able to come up with something better than working around stuff here?

craigwi commented 10 years ago

Thanks @s1monw. In the short term, I agree with the proposal to enable us to set the concurrency level.

The longer term question seems to be a choice of whether to go with a circuit breaker approach which necessitates taking proactive steps as soon as one notices the breaker tripped (since customers are likely impacted) OR to go with something else that exhibits a slow degradation in performance, that too has to be noticed, and would give one more time to react.

The current defaults fit the first approach, of course, and a proper sized-based eviction fielddata cache would fit the later approach. There must be other ideas...

s1monw commented 10 years ago

The longer term question seems to be a choice of whether to go with a circuit breaker approach which necessitates taking proactive steps as soon as one notices the breaker tripped (since customers are likely impacted) OR to go with something else that exhibits a slow degradation in performance, that too has to be noticed, and would give one more time to react.

To be honest size based evictions often result in cluster meltdown since if you hit the upper limit you keep on loaded and unloading field data which will cause large amount of memory to be allocated and freed etc. you run into GCs due to evictions which are async btw. In order to react to situations where you are really close to the limits I think rejecting requests for health reasons is the way to go clients need to deal with this IMO the cluster health is way more important at that point.

The current defaults fit the first approach, of course, and a proper sized-based eviction fielddata cache would fit the later approach. There must be other ideas..

I think we need to fix this caveat here and maybe go with a default concurrency of 1 for field data if the size based eviction is used in the field data case. It think that is the right fix for now - to be honest I doubt there will be a big perf impact but we would need to verify - from looking at the code the read operations are most likely not under a lock such that contention is only likely to happen in the write case which I don't think will be very problematic here.

craigwi commented 10 years ago

To be honest size based evictions often result in cluster meltdown

If this is true, of course I would prefer orderly failures to meltdown. We experienced all sorts of problems before we made OOM kill java.exe and before we switched to G1 GC. Now we see the gradual slowdown when evictions start as I wrote.

Regarding https://github.com/elasticsearch/elasticsearch/pull/8112, I would like to see at least the fielddata cache part of that in a 1.3.x release -- unless 1.4 is imminent.

s1monw commented 10 years ago

hey @craigwi I don't think this counts as a bugfix really and therefore I won't backport it to 1.3.x Given the fact that 1.4 is in beta and we moving towards GA soonish it is imminent

craigwi commented 10 years ago

Thanks @s1monw. The fact that ES doesn't behave as documented would seem to me to be as a bug (cf. https://github.com/elasticsearch/elasticsearch/issues/7841), but I recognize back porting is not free or even cheap.

Will this show up in a 1.4 beta release? Which one? I really do want to verify the behavior of the new option ASAP.

clintongormley commented 9 years ago

We need to investigate other cache options

clintongormley commented 9 years ago

Guava has been removed. Closing

elastic / elasticsearch

Fix unnecessary cache evictions when setting indices.fielddata.cache.size #7836