Investigate BKDReader usage and cacheability in aggs

DJRickyB commented 2 years ago

Background

Recently @salvatore-campagna and others added multiple aggregations-centric workloads to our nightly benchmarks at https://elasticsearch-benchmarks.elastic.co/. One in particular has been failing frequently and was recently removed; the aggs challenge in the nyc_taxis track, defined here: https://github.com/elastic/rally-tracks/blob/master/nyc_taxis/challenges/default.json#L506. We were running this workload with two single-node configurations, one with an 8GB heap and one with a 1GB heap.

The purpose of the 1GB heap was to track performance in a memory-constrained environment, in case changes to the JVM, GC settings, or object sizes over time lead to regressions. However, this configuration ended up being too unstable in its current form to run on a nightly basis, as errors during the benchmark fail the entire run, and we publish gaps in the charts.

Problem

At the point where the benchmark breaks, we spam the following search repeatedly, without the query cache:

{
  "size": 0,
  "query": {
    "range": {
      "dropoff_datetime": {
        "gte": "2015-01-01 00:00:00",
        "lt": "2015-03-01 00:00:00"
      }
    }
  },
  "aggs": {
    "dropoffs_over_time": {
      "date_histogram": {
        "field": "dropoff_datetime",
        "calendar_interval": "week",
        "time_zone": "America/New_York"
      }
    }
  }
}

With a 1G heap we can pretty reliably get an error like

{'error': {'root_cause': [], 'type': 'search_phase_execution_exception', 'reason': '', 'phase': 'fetch', 'grouped': True, 'failed_shards': [], 'caused_by': {'type': 'circuit_breaking_exception', 'reason': '[parent] Data too large, data for [<reduce_aggs>] would be [1035903210/987.9mb], which is larger than the limit of [1020054732/972.7mb],
real usage: [1035902976/987.9mb], new bytes reserved: [234/234b], usages [eql_sequence=0/0b, model_inference=0/0b, inflight_requests=484/484b, request=234/234b, fielddata=0/0b]', 'bytes_wanted': 1035903210, 'bytes_limit': 1020054732, 'durability': 'TRANSIENT'}}, 'status': 429}

Summarized Findings

We see 800 mb of humongous allocations at the peak during the benchmark, prior to failing with the circuit_breaking_exception. These are evidently large int[]s used as backing store for a Lucene BKDPointTree. We see method invocations of DocIdSetBuilder::grow and DocIdSetBuilder::addBuffer in the allocation stacktraces for these int[]s. Memory (img) Allocations (img)
Setting the indices.breaker.total.limit to 100% allowed the benchmark to succeed. A 60 ms major GC cleaned up the humongous objects left over by prior searches and there were no circuit_breaking_exceptions.

Items to investigate

Aggregation/Search/Lucene changes

Is the bitset in BKDReader usably cacheable? In other words, is its structure/contents dependent on the query, or is it a generic representation of the docs in the index?
- We should consider caching this object if it is reusable as outlined. Humongous objects should be prevented to the degree we can, because they can impact the efficiency of the garbage collector due to occupying contiguous regions in old gen, adding concerns around major garbage collections, fragmentation, and waste to memory management
Is the bitset in BKDReader optimally allocated? The presence of a method called grow() in the stack trace in Summarized Finding 1 above implies to me an improperly or generically-sized structure, and worse one that may need to copy on write (in order to be a single array) and is the G1 definition of humongous. Even if we cannot cache it, preventing multiple allocations while populating the bitset should be a win for both memory and compute here

elasticmachine commented 2 years ago

Pinging @elastic/es-analytics-geo (Team:Analytics)

elasticsearchmachine commented 5 months ago

Pinging @elastic/es-analytical-engine (Team:Analytics)

elastic / elasticsearch