elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.39k stars 24.87k forks source link

Address memory pressure for intensive aggregations and small heaps #86531

Open DJRickyB opened 2 years ago

DJRickyB commented 2 years ago

Background

Recently @salvatore-campagna and others added multiple aggregations-centric workloads to our nightly benchmarks at https://elasticsearch-benchmarks.elastic.co/. One in particular has been failing frequently and was recently removed; the aggs challenge in the nyc_taxis track, defined here: https://github.com/elastic/rally-tracks/blob/master/nyc_taxis/challenges/default.json#L506. We were running this workload with two single-node configurations, one with an 8GB heap and one with a 1GB heap.

The purpose of the 1GB heap was to track performance in a memory-constrained environment, in case changes to the JVM, GC settings, or object sizes over time lead to regressions. However, this configuration ended up being too unstable in its current form to run on a nightly basis, as errors during the benchmark fail the entire run, and we publish gaps in the charts.

Problem

At the point where the benchmark breaks, we spam the following search repeatedly, without the query cache:

{
  "size": 0,
  "query": {
    "range": {
      "dropoff_datetime": {
        "gte": "2015-01-01 00:00:00",
        "lt": "2015-03-01 00:00:00"
      }
    }
  },
  "aggs": {
    "dropoffs_over_time": {
      "date_histogram": {
        "field": "dropoff_datetime",
        "calendar_interval": "week",
        "time_zone": "America/New_York"
      }
    }
  }
}

With a 1G heap we can pretty reliably get an error like

{'error': {'root_cause': [], 'type': 'search_phase_execution_exception', 'reason': '', 'phase': 'fetch', 'grouped': True, 'failed_shards': [], 'caused_by': {'type': 'circuit_breaking_exception', 'reason': '[parent] Data too large, data for [<reduce_aggs>] would be [1035903210/987.9mb], which is larger than the limit of [1020054732/972.7mb],
real usage: [1035902976/987.9mb], new bytes reserved: [234/234b], usages [eql_sequence=0/0b, model_inference=0/0b, inflight_requests=484/484b, request=234/234b, fielddata=0/0b]', 'bytes_wanted': 1035903210, 'bytes_limit': 1020054732, 'durability': 'TRANSIENT'}}, 'status': 429}

Summarized Findings

  1. We see 800 mb of humongous allocations at the peak during the benchmark, prior to failing with the circuit_breaking_exception. These are evidently large int[]s used as backing store for a Lucene BKDPointTree. We see method invocations of DocIdSetBuilder::grow and DocIdSetBuilder::addBuffer in the allocation stacktraces for these int[]s. Memory (img) Allocations (img)
  2. Setting the indices.breaker.total.limit to 100% allowed the benchmark to succeed. A 60 ms major GC cleaned up the humongous objects left over by prior searches and there were no circuit_breaking_exceptions.

Items to investigate

What is left now is to evaluate each of the below items as something to discuss, something to implement, or a non-starter.

salvatore-campagna commented 2 years ago

profiles.zip

The profiles.zip archive includes three JFR recordings (each recording is a different Rally race).

elasticmachine commented 2 years ago

Pinging @elastic/es-perf (Team:Performance)

elasticmachine commented 2 years ago

Pinging @elastic/es-analytics-geo (Team:Analytics)

grcevski commented 2 years ago

Humongous objects are particularly hard for G1GC. It typically has reserved area for them, but given a large volume of humongous allocations (and small heap) it will easily overflow this area and it has no choice but to allocate in the old space. The old space is what we look at when figuring out how much free memory we have, so having all these (perhaps dead now) objects in the old space appears to us as if we cannot satisfy the request. G1 will fight fiercely not to do a global collect and clean up the old space, if it can satisfy the allocation requests with simply doing young space collections. So this situation with old space full of dead objects might persist a while.

With low core count and heap size, OpenJDK will typically choose SerialGC by default instead of G1:

https://github.com/openjdk/jdk/blob/ad54d8dd832b22485d7ac45958cc4c9bfd70fbd2/src/hotspot/share/gc/shared/gcConfig.cpp#L98

and

https://github.com/openjdk/jdk/blob/ad54d8dd832b22485d7ac45958cc4c9bfd70fbd2/src/hotspot/share/runtime/os.cpp#L1677

I'm not sure what's the answer here, I don't think Serial would be a good choice for us, but perhaps experimenting with ShenandoahGC for low heap sizes is worth a shot. It's the closest collector to old CMS. Performance will definitely be worse compared to G1, but we may not hit the circuit breakers as often.

Another option is to call System.gc() when we are about to fail with the circuit breaker. That will definitely nudge G1 to perform collection of the old space. With low heap sizes calling System.gc() will not be deadly.

jaymode commented 2 years ago

I don't think Serial would be a good choice for us

Can you elaborate on this?

perhaps experimenting with ShenandoahGC for low heap sizes is worth a shot. It's the closest collector to old CMS. Performance will definitely be worse compared to G1, but we may not hit the circuit breakers as often.

Do we think Shenandoah is ready for prime time? Also, these low heap instances also have smaller amounts of CPU given to them so picking a GC that steals time from application threads might not be great. It is worth noting that on these instances with small heaps running G1GC that we don't have a lot of OOMs, but do hit the circuit breakers frequently and also log a lot messages like gc overhead, spent [3.9s] collecting in the last [4s].

grcevski commented 2 years ago

I don't think Serial would be a good choice for us

Can you elaborate on this?

The problem is that we try to estimate if we'll go out of memory based on stale object liveness statistics. In a sense, the old area liveness information reported by the GC is always wrong (or out of date), it's only recalculated on full GCs, but until we do one we can't tell. SerialGC doesn't do concurrent mark, which is our only hope to find out what's actually live in the old area before we actually get a GC to run. G1GC does concurrent mark, but it needs to be certain it needs it before it will trigger it. We can experiment with G1ReservePercent and InitiatingHeapOccupancyPercent to let G1 concurrent mark run early and have time to finish on time for these small heap sizes.

ShenandoahGC is now included by default in JDK 17 and not experimental, so it's ready for us to use. Shenandoah might work because of its pacer which forces it to do a lot of preemptive GC work, marking and cleaning up a lot more often. The pacer taxes the application for allocating a lot, stealing application time to do GC work. I think with Shenandoah our heap occupancy estimates will always be much closer to reality, however the collector overhead will be larger. Having more GC overhead may not be an issue, as long as we can process every transaction, rather than rejecting some.

It is worth noting that on these instances with small heaps running G1GC that we don't have a lot of OOMs, but do hit the circuit breakers frequently and also log a lot messages like gc overhead, spent [3.9s] collecting in the last [4s].

This makes sense, based on the descriptions above the humongous allocations are transient, if we were to run a global collect we'd free bunch of the heap and be able to satisfy the request. The GC estimates of how much memory is actually available are of date, causing us to reject the request too soon. When the heap is large enough, the estimate error is not that significant, but it can be dead wrong for small heap sizes.

I think we have few choices here with some experimentation:

DJRickyB commented 2 years ago

@grcevski Thank you for the input so far. As this is the meta issue for the overall effort, if you intend to work/investigate can you do it in context of #88518 (possibly renaming and re-purposing it)? I have some thoughts there about G1 knobs that may be on or off the mark but also highlight my thinking on the status quo

nik9000 commented 2 years ago

I'm out of the office but I know the aggs team has some ideas to use more efficient representations in the least efficient stages. That should help a ton with memory pressure. On small and large heaps.

If you can juggle some knobs in gc to help, that's great. But we will try and help on the other side.

On Mon, Aug 8, 2022, 5:01 PM Rick Boyd @.***> wrote:

@grcevski https://github.com/grcevski Thank you for the input so far. As this is the meta issue for the overall effort, if you intend to work/investigate can you do it in context of #88518 https://github.com/elastic/elasticsearch/issues/88518 (possibly renaming and re-purposing it)? I have some thoughts there about G1 knobs that may be on or off the mark but also highlight my thinking on the status quo

— Reply to this email directly, view it on GitHub https://github.com/elastic/elasticsearch/issues/86531#issuecomment-1208605587, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABUXIUTUFWXN2AAWHERJUDVYFYSVANCNFSM5VJEGZYA . You are receiving this because you are on a team that was mentioned.Message ID: @.***>