elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.62k stars 24.64k forks source link

Re-sorting of time series aggregation buckets in TimeSeriesAggregator #103571

Open salvatore-campagna opened 9 months ago

salvatore-campagna commented 9 months ago

Elasticsearch Version

8.13 and above

Installed Plugins

No response

Java Version

bundled

OS Version

All

Problem Description

The TimeSeriesAggregator used to process data in (_tsid, @timestamp) order. As a result of introducing _tsid hashing now the TimeSeriesAggregator processes data in (_tsid hash, @timestamp) order. Because of this we need to re-sort data in TimeSeriesAggregator#buildAggregations. This is required because in the collect method we assume that a bucket is exhausted when the _tsid hash changes. Anyway, sorting on _tsid and _tsid hash might result in different sorting due to hashing. This is not really ideal performance-wise because of the in-memoery sorting which slows down time series aggregations.

Ideally we would like to avoid re-sorting into the aggregator. Maybe we can move access to doc values which we now have in TimeSeriesAggregator#getLeafCollector up into TimeSeriesAggregator#buildAggregations avoiding re-sorting of buckets. That, anyway requires us to keep track of ordinals and the segment they belong to so that we can read doc values correctly and fill the aggregation result with correct dimension values.

Steps to Reproduce

Just run a time series aggregation on a time series index.

Logs (if relevant)

No response

elasticsearchmachine commented 9 months ago

Pinging @elastic/es-analytics-geo (Team:Analytics)