elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.72k stars 24.67k forks source link

Sparse index for tsdb #95701

Open martijnvg opened 1 year ago

martijnvg commented 1 year ago

Introduce a sparse index data structure in Elasticsearch for tsdb. With tsdb indices are sorted by the _tsid field (which is composed out of dimension fields) and then by @timestamp field. A spare index can be used to easily navigate to other documents of a time serie that have matched with the main query. Conceptually the key of the sparse in would be the tsid and the value would be a docid range. This would be a per segment data structure. The docid range can be used to immediately navigate to a specific document or advance to a document of the next time serie.

With sparse index the following tsdb aggregation features can be implemented efficiently: last value, interpolation and geo fencing. Other tsdb aggregation functionality like the rate aggregation on a counter field. Today the rate on counter fields is computed with documents that match with the main query, but in order to accurately compute the rate of a counter field, all documents need to be evaluated. For example, filtering on a certain counter threshold and then compute the rate on that same counter field.

Currently a sparse index doesn't exist in either Elasticsearch or Lucene. The idea is to build an in-memory sparse index in Elasticsearch, which can be replaced when sparse index becomes available in Lucene. The in-memory data structure would then only read the Nth document and then load the corresponding _tsid. This data structure will not hold all tsdb ids and corresponding document ranges in memory. But with the tsid of every Nth document, it can compute the document range of the requested tsid in binary fashion at query time.

elasticsearchmachine commented 1 year ago

Pinging @elastic/es-analytics-geo (Team:Analytics)

felixbarny commented 1 year ago

Will this allow us to avoid creating an entry in the inverted index per individual data point/metric value so that we only index each time series rather than each data point in each time series?