Open DJRickyB opened 2 years ago
The profiles.zip
archive includes three JFR recordings (each recording is a different Rally race).
profile-excluding.jfr
: the JFR recording stops the challenge just before running the date histogram operationprofile-including.jfr
: the JFR recording stops the challenge just after running the date histogram operationprofile-nocb.jfr
: the JFR recording stops the challenge just after running the date histogram and the circuit breaker is disabledPinging @elastic/es-perf (Team:Performance)
Pinging @elastic/es-analytics-geo (Team:Analytics)
Humongous objects are particularly hard for G1GC. It typically has reserved area for them, but given a large volume of humongous allocations (and small heap) it will easily overflow this area and it has no choice but to allocate in the old space. The old space is what we look at when figuring out how much free memory we have, so having all these (perhaps dead now) objects in the old space appears to us as if we cannot satisfy the request. G1 will fight fiercely not to do a global collect and clean up the old space, if it can satisfy the allocation requests with simply doing young space collections. So this situation with old space full of dead objects might persist a while.
With low core count and heap size, OpenJDK will typically choose SerialGC by default instead of G1:
and
I'm not sure what's the answer here, I don't think Serial would be a good choice for us, but perhaps experimenting with ShenandoahGC for low heap sizes is worth a shot. It's the closest collector to old CMS. Performance will definitely be worse compared to G1, but we may not hit the circuit breakers as often.
Another option is to call System.gc() when we are about to fail with the circuit breaker. That will definitely nudge G1 to perform collection of the old space. With low heap sizes calling System.gc() will not be deadly.
I don't think Serial would be a good choice for us
Can you elaborate on this?
perhaps experimenting with ShenandoahGC for low heap sizes is worth a shot. It's the closest collector to old CMS. Performance will definitely be worse compared to G1, but we may not hit the circuit breakers as often.
Do we think Shenandoah is ready for prime time? Also, these low heap instances also have smaller amounts of CPU given to them so picking a GC that steals time from application threads might not be great. It is worth noting that on these instances with small heaps running G1GC that we don't have a lot of OOMs, but do hit the circuit breakers frequently and also log a lot messages like gc overhead, spent [3.9s] collecting in the last [4s]
.
I don't think Serial would be a good choice for us
Can you elaborate on this?
The problem is that we try to estimate if we'll go out of memory based on stale object liveness statistics. In a sense, the old area liveness information reported by the GC is always wrong (or out of date), it's only recalculated on full GCs, but until we do one we can't tell. SerialGC doesn't do concurrent mark, which is our only hope to find out what's actually live in the old area before we actually get a GC to run. G1GC does concurrent mark, but it needs to be certain it needs it before it will trigger it. We can experiment with G1ReservePercent and InitiatingHeapOccupancyPercent to let G1 concurrent mark run early and have time to finish on time for these small heap sizes.
ShenandoahGC is now included by default in JDK 17 and not experimental, so it's ready for us to use. Shenandoah might work because of its pacer which forces it to do a lot of preemptive GC work, marking and cleaning up a lot more often. The pacer taxes the application for allocating a lot, stealing application time to do GC work. I think with Shenandoah our heap occupancy estimates will always be much closer to reality, however the collector overhead will be larger. Having more GC overhead may not be an issue, as long as we can process every transaction, rather than rejecting some.
It is worth noting that on these instances with small heaps running G1GC that we don't have a lot of OOMs, but do hit the circuit breakers frequently and also log a lot messages like gc overhead, spent [3.9s] collecting in the last [4s].
This makes sense, based on the descriptions above the humongous allocations are transient, if we were to run a global collect we'd free bunch of the heap and be able to satisfy the request. The GC estimates of how much memory is actually available are of date, causing us to reject the request too soon. When the heap is large enough, the estimate error is not that significant, but it can be dead wrong for small heap sizes.
I think we have few choices here with some experimentation:
@grcevski Thank you for the input so far. As this is the meta issue for the overall effort, if you intend to work/investigate can you do it in context of #88518 (possibly renaming and re-purposing it)? I have some thoughts there about G1 knobs that may be on or off the mark but also highlight my thinking on the status quo
I'm out of the office but I know the aggs team has some ideas to use more efficient representations in the least efficient stages. That should help a ton with memory pressure. On small and large heaps.
If you can juggle some knobs in gc to help, that's great. But we will try and help on the other side.
On Mon, Aug 8, 2022, 5:01 PM Rick Boyd @.***> wrote:
@grcevski https://github.com/grcevski Thank you for the input so far. As this is the meta issue for the overall effort, if you intend to work/investigate can you do it in context of #88518 https://github.com/elastic/elasticsearch/issues/88518 (possibly renaming and re-purposing it)? I have some thoughts there about G1 knobs that may be on or off the mark but also highlight my thinking on the status quo
— Reply to this email directly, view it on GitHub https://github.com/elastic/elasticsearch/issues/86531#issuecomment-1208605587, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABUXIUTUFWXN2AAWHERJUDVYFYSVANCNFSM5VJEGZYA . You are receiving this because you are on a team that was mentioned.Message ID: @.***>
Background
Recently @salvatore-campagna and others added multiple aggregations-centric workloads to our nightly benchmarks at https://elasticsearch-benchmarks.elastic.co/. One in particular has been failing frequently and was recently removed; the
aggs
challenge in thenyc_taxis
track, defined here: https://github.com/elastic/rally-tracks/blob/master/nyc_taxis/challenges/default.json#L506. We were running this workload with two single-node configurations, one with an 8GB heap and one with a 1GB heap.The purpose of the 1GB heap was to track performance in a memory-constrained environment, in case changes to the JVM, GC settings, or object sizes over time lead to regressions. However, this configuration ended up being too unstable in its current form to run on a nightly basis, as errors during the benchmark fail the entire run, and we publish gaps in the charts.
Problem
At the point where the benchmark breaks, we spam the following search repeatedly, without the query cache:
With a 1G heap we can pretty reliably get an error like
Summarized Findings
int[]
s used as backing store for a LuceneBKDPointTree
. We see method invocations ofDocIdSetBuilder::grow
andDocIdSetBuilder::addBuffer
in the allocation stacktraces for theseint[]
s. Memory (img) Allocations (img)indices.breaker.total.limit
to100%
allowed the benchmark to succeed. A 60 ms major GC cleaned up the humongous objects left over by prior searches and there were no circuit_breaking_exceptions.Items to investigate
What is left now is to evaluate each of the below items as something to discuss, something to implement, or a non-starter.