The purpose of the 1GB heap was to track performance in a memory-constrained environment, in case changes to the JVM, GC settings, or object sizes over time lead to regressions. However, this configuration ended up being too unstable in its current form to run on a nightly basis, as errors during the benchmark fail the entire run, and we publish gaps in the charts.
Problem
At the point where the benchmark breaks, we spam the following search repeatedly, without the query cache:
With a 1G heap we can pretty reliably get an error like
{'error': {'root_cause': [], 'type': 'search_phase_execution_exception', 'reason': '', 'phase': 'fetch', 'grouped': True, 'failed_shards': [], 'caused_by': {'type': 'circuit_breaking_exception', 'reason': '[parent] Data too large, data for [<reduce_aggs>] would be [1035903210/987.9mb], which is larger than the limit of [1020054732/972.7mb],
real usage: [1035902976/987.9mb], new bytes reserved: [234/234b], usages [eql_sequence=0/0b, model_inference=0/0b, inflight_requests=484/484b, request=234/234b, fielddata=0/0b]', 'bytes_wanted': 1035903210, 'bytes_limit': 1020054732, 'durability': 'TRANSIENT'}}, 'status': 429}
Summarized Findings
We see 800 mb of humongous allocations at the peak during the benchmark, prior to failing with the circuit_breaking_exception. These are evidently large int[]s used as backing store for a Lucene BKDPointTree. We see method invocations of DocIdSetBuilder::grow and DocIdSetBuilder::addBuffer in the allocation stacktraces for these int[]s.
Memory (img)Allocations (img)
Setting the indices.breaker.total.limit to 100% allowed the benchmark to succeed. A 60 ms major GC cleaned up the humongous objects left over by prior searches and there were no circuit_breaking_exceptions.
Items to investigate
Rally-tracks changes
Setting a target-throughput on the tasks in question could give the background activity in G1 enough time to keep up with collectable regions (see also the reduction of G1MixedGCCountTarget for small heaps suggestion, above). We have been moving away from this configuration as the target-throughputs can obscure some performance characteristics of operations, and be difficult to tune for broad usage. We can also consider adding support for a fixed think-time parameter to allow for some cool down between invocations.
I'm personally not a fan of this approach, as it seems like one would expect a single thread to not overwhelm a system like Elasticsearch on its own by trying the same operation over and over.
In the course of the aggs challenge, we could add settings modifications for the circuit breaker limit to be toggled during the run, measuring latency when the breaker is 100% and measuring failures when it is default.
Currently we run benchmarks with --on-error abort. We would need to chart rather than fail on error for the lower limit value.
Background
Recently @salvatore-campagna and others added multiple aggregations-centric workloads to our nightly benchmarks at https://elasticsearch-benchmarks.elastic.co/. One in particular has been failing frequently and was recently removed; the
aggs
challenge in thenyc_taxis
track, defined here: https://github.com/elastic/rally-tracks/blob/master/nyc_taxis/challenges/default.json#L506. We were running this workload with two single-node configurations, one with an 8GB heap and one with a 1GB heap.The purpose of the 1GB heap was to track performance in a memory-constrained environment, in case changes to the JVM, GC settings, or object sizes over time lead to regressions. However, this configuration ended up being too unstable in its current form to run on a nightly basis, as errors during the benchmark fail the entire run, and we publish gaps in the charts.
Problem
At the point where the benchmark breaks, we spam the following search repeatedly, without the query cache:
With a 1G heap we can pretty reliably get an error like
Summarized Findings
int[]
s used as backing store for a LuceneBKDPointTree
. We see method invocations ofDocIdSetBuilder::grow
andDocIdSetBuilder::addBuffer
in the allocation stacktraces for theseint[]
s. Memory (img) Allocations (img)indices.breaker.total.limit
to100%
allowed the benchmark to succeed. A 60 ms major GC cleaned up the humongous objects left over by prior searches and there were no circuit_breaking_exceptions.Items to investigate
Rally-tracks changes
target-throughput
on the tasks in question could give the background activity in G1 enough time to keep up with collectable regions (see also the reduction of G1MixedGCCountTarget for small heaps suggestion, above). We have been moving away from this configuration as the target-throughputs can obscure some performance characteristics of operations, and be difficult to tune for broad usage. We can also consider adding support for a fixedthink-time
parameter to allow for some cool down between invocations.aggs
challenge, we could add settings modifications for the circuit breaker limit to be toggled during the run, measuring latency when the breaker is100%
and measuring failures when it is default.--on-error abort
. We would need to chart rather than fail on error for the lower limit value.