elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.5k stars 24.6k forks source link

Downsampling bulk indexing throttling #101970

Open salvatore-campagna opened 10 months ago

salvatore-campagna commented 10 months ago

Description

The downsampling task is a single thread task when it comes to metric aggregations. Anyway, indexing documents into the target index happens using the BulkProcessor2. The downsampling thread submits indexing requests without waiting for a response so to achieve maximum throughput. As a result of that, normally, there are multiple outstanding indexing requests consuming threads from the search/indexing thread pool. That can result in using all available threads for downsampling (indexing) without leaving room for other tasks, like regular indexing, to be executed. Ideally we would like to implement a mechanism by which we limit the number of outstanding indexing requests so to limit the number of threads used for indexing by the downsampling thread. Also we would like to expose this limit as a setting that users can control. A possibility would be to expose the maxBytesInFlight of BulkProcessor2 as a setting (instead of setting it as a constant as it is right now).

elasticsearchmachine commented 10 months ago

Pinging @elastic/es-analytics-geo (Team:Analytics)

martijnvg commented 10 months ago

Maybe we should make downsample bulk size and max bytes in flight request parameters to the downsample request? Then this can be configured from ILM or DSL.

salvatore-campagna commented 10 months ago

The only thing I am wondering about is if changing maxBytesInFlight is a good idea...I expect users would be more interested in controlling the number of threads. I think it is a bit difficult to change maxBytesInFlight if one wants to really act on CPU usage.

martijnvg commented 10 months ago

Right, I think setMaxBytesInFlight() will just result in a EsRejectedExecutionException and we don't handle it in DownsampleShardIndexer. The other nobs: setBulkActions() and setBulkSize(), but that just controls the size of the bulk request.

If we want to control the write load from downsampling, then I think we need to make a change in downsampling to pause the reading (and rolling up of documents) and resume when there is write capability? This is a more complex change than just exposing some of the configurations options of BulkProcessor2.

DaveCTurner commented 10 months ago

I think we should expose those BulkProcessor2 parameters to the caller in any case - I'm almost certain we will want to tune them up or down at some point in the future.

I think setMaxBytesInFlight() will just result in a EsRejectedExecutionException and we don't handle it

We're using BulkProcessor2#addWithBackpressure() so hitting the bytes-in-flight limit blocks the calling thread until some other in-flight requests succeed. You only get an EsRejectedExecutionException on abort.

In terms of the throttling implementation, I wonder if a regular bytes-per-second SimpleRateLimiter would be good enough. My feeling is that this is a parameter that could be at least a little meaningful to the end-user, because they should be able to work out roughly how much downsampled data they're producing every hour/day, so a bytes-per-second rate limit would help even out the naturally spiky workload caused by downsampling.

martijnvg commented 9 months ago

We're using BulkProcessor2#addWithBackpressure() so hitting the bytes-in-flight limit blocks the calling thread until some other in-flight requests succeed. You only get an EsRejectedExecutionException on abort.

Then my understanding was incorrect. Thanks for mentioning this. Exposing the BulkProcessor2 parameters makes sense. This could then already help with downsample spikes.

I wonder if a regular bytes-per-second SimpleRateLimiter would be good enough. My feeling is that this is a parameter that could be at least a little meaningful to the end-user, because they should be able to work out roughly how much downsampled data they're producing every hour/day, so a bytes-per-second rate limit would help even out the naturally spiky workload caused by downsampling.

I think users typically know the number of metrics that get scrapped with what dimensions at each interval. It is difficult to reason how much bytes each metric ends using due to how metrics are stored in Elasticsearch (some fields are indexed and have doc values, the number of metrics that are stored per document). Therefor, I think it also difficult to reason about downsample throughput in bytes per second. However I do think throttling is useful. Maybe documents per second is more meaningful?

elasticsearchmachine commented 2 months ago

Pinging @elastic/es-storage-engine (Team:StorageEngine)