elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.03k stars 24.5k forks source link

Add time series support to compute engine #105397

Open martijnvg opened 6 months ago

martijnvg commented 6 months ago

This is the meta issue that tracks the work to be done to the compute engine in order to power time series support. This for now at least doesn't include the language changes to ES|QL. The compute engine components should only be active via enabling specific query pragmas, until the time series compute engine components are more stable and the es|ql language is ready to adopt it.

General overview

image

(an overview of how time series aggregation can work in the compute engine (assuming all time series don't cross backing index boundary))

The idea is that a new source operator will emit all matching document in time series order (_tsid ascending, @timestamp descending). Documents are sorted in that order at the segment level, but not at the shard level. A page will additionally also include tsid and timestamp blocks. Documents of the same time serie should be contained by the block. A new time series grouping operator will make use of the sorted nature of the pages that the source operator emits and groups by tsid or tsid and timestamp interval. The output of this operator can be used by other operates such as the HashAggregationOperator.

Sometimes not all samples or a time series are in the same shard. This can happen when a query targets multiple backing indices of a tsdb data stream. In this case we need for the affected time series post pone grouping in the new time series grouping operator. The new time series grouping operator needs to group these time series on the coordinating node (when the aggregation mode is final in AggregateExec). Initially we will build a time series grouping operator that assumes that time series are always scattered across multiple backing indices and thus performs the grouping when the aggregate mode is final. In follow ups, we can then improve the new time series grouping operator to detect when time series don't cross backing index boundaries. In that case the grouping can perform locally, when aggregation mode is partial.

Initially we will only allow filtering on dimension fields. More specifically the filters that get pushed down to the time series source operator. If filters on labels or metrics get pushed down to the source operator we run at risk of breaking the ordered samples of a time serie apart.

Tasks

Optional:

elasticsearchmachine commented 6 months ago

Pinging @elastic/es-analytical-engine (Team:Analytics)

elasticsearchmachine commented 6 months ago

Pinging @elastic/es-storage-engine (Team:StorageEngine)