Open salvatore-campagna opened 10 months ago
Pinging @elastic/es-analytics-geo (Team:Analytics)
Maybe we should make downsample bulk size and max bytes in flight request parameters to the downsample request? Then this can be configured from ILM or DSL.
The only thing I am wondering about is if changing maxBytesInFlight
is a good idea...I expect users would be more interested in controlling the number of threads. I think it is a bit difficult to change maxBytesInFlight
if one wants to really act on CPU usage.
Right, I think setMaxBytesInFlight()
will just result in a EsRejectedExecutionException
and we don't handle it in DownsampleShardIndexer
. The other nobs: setBulkActions()
and setBulkSize()
, but that just controls the size of the bulk request.
If we want to control the write load from downsampling, then I think we need to make a change in downsampling to pause the reading (and rolling up of documents) and resume when there is write capability? This is a more complex change than just exposing some of the configurations options of BulkProcessor2
.
I think we should expose those BulkProcessor2
parameters to the caller in any case - I'm almost certain we will want to tune them up or down at some point in the future.
I think
setMaxBytesInFlight()
will just result in aEsRejectedExecutionException
and we don't handle it
We're using BulkProcessor2#addWithBackpressure()
so hitting the bytes-in-flight limit blocks the calling thread until some other in-flight requests succeed. You only get an EsRejectedExecutionException
on abort.
In terms of the throttling implementation, I wonder if a regular bytes-per-second SimpleRateLimiter
would be good enough. My feeling is that this is a parameter that could be at least a little meaningful to the end-user, because they should be able to work out roughly how much downsampled data they're producing every hour/day, so a bytes-per-second rate limit would help even out the naturally spiky workload caused by downsampling.
We're using BulkProcessor2#addWithBackpressure() so hitting the bytes-in-flight limit blocks the calling thread until some other in-flight requests succeed. You only get an EsRejectedExecutionException on abort.
Then my understanding was incorrect. Thanks for mentioning this. Exposing the BulkProcessor2
parameters makes sense. This could then already help with downsample spikes.
I wonder if a regular bytes-per-second SimpleRateLimiter would be good enough. My feeling is that this is a parameter that could be at least a little meaningful to the end-user, because they should be able to work out roughly how much downsampled data they're producing every hour/day, so a bytes-per-second rate limit would help even out the naturally spiky workload caused by downsampling.
I think users typically know the number of metrics that get scrapped with what dimensions at each interval. It is difficult to reason how much bytes each metric ends using due to how metrics are stored in Elasticsearch (some fields are indexed and have doc values, the number of metrics that are stored per document). Therefor, I think it also difficult to reason about downsample throughput in bytes per second. However I do think throttling is useful. Maybe documents per second is more meaningful?
Pinging @elastic/es-storage-engine (Team:StorageEngine)
Description
The downsampling task is a single thread task when it comes to metric aggregations. Anyway, indexing documents into the target index happens using the
BulkProcessor2
. The downsampling thread submits indexing requests without waiting for a response so to achieve maximum throughput. As a result of that, normally, there are multiple outstanding indexing requests consuming threads from the search/indexing thread pool. That can result in using all available threads for downsampling (indexing) without leaving room for other tasks, like regular indexing, to be executed. Ideally we would like to implement a mechanism by which we limit the number of outstanding indexing requests so to limit the number of threads used for indexing by the downsampling thread. Also we would like to expose this limit as a setting that users can control. A possibility would be to expose themaxBytesInFlight
ofBulkProcessor2
as a setting (instead of setting it as a constant as it is right now).