Open asfimport opened 2 years ago
Adrien Grand (@jpountz) (migrated from JIRA)
I know that the Elasticsearch team is looking into doing things like that, but on top of Lucene by creating another index that has a different granularity instead of having different granularities within the same index and relying on background merges for rollups.
At first sight, doing it within the same index feels a bit scary to me:
What would you think about doing it on top of Lucene instead, e.g. similarly to how the faceting module maintains a side-car taxonomy index, maybe one could maintain a side-car rollup index to speed up aggregations?
Suhan Mao (migrated from JIRA)
@jpountz Thanks for your reply!
As I know, the current rollup implementation in ES is to periodically run a composite aggregation to query the aggregated result and insert into another index.
But this approach has several disadvantages:
To answer your questions
How we can start from scratch
I think we can start from a sidecar solution first. Assume that index A is the index storing raw data. And index A' is a sidecar index which is a continuous rolling up index.
Assume that the schema of index A is:
d0 time, d1 long, d2 keyword, m1 long, m2 long, m3 binary(hll),x1,x2,x3 ......
x1, x2 and x3 fields are no related to rollup and they are just additional normal fields.
d0 is the event time, d1 and d2 are all dimensions and m1, m2 and m3 are all metrics.
If we want to rollup the data to hourly granularity, we can create a rollup sidecar index A' which only contains d0, d1, d2, m1, m2, m3 fields and do rollup during merge process. User can submit query to A or A' accordingly.
What's more, we can create several rollup indices which is often called "materialized view" in OLAP scenarios.
For example, if we need another view that only store d0, d1, m3 and rollup granularity is daily, we can create an additional sidecar index A''.
User only need to write raw data once to index A and all the rollup calculation is performed in the internal of lucene. User should submit query to different level of indices accordingly.
What do you think?
Suhan Mao (migrated from JIRA)
@jpountz sorry to interrupt you, could you share some opinion about this rollup feature. Is it worthy to move on or we need extra discussion about it?
Adrien Grand (@jpountz) (migrated from JIRA)
Thanks I understand better now. With the sidecar approach, could you compute rollups at index time by performing updates instead of hooking into the merging process? For instance if a user is adding a new sample, you could retrieve data for the current <your-data-granularity-goes-here> bucket for the given dimensions and update the min/max/sum values?
Suhan Mao (migrated from JIRA)
@jpountz Sorry for the late reply and thanks for your suggestion. I understand computing rollup during index time is easy to implement, but there's still some drawback that should be taken into consideration.
It will slow down the index performance because it need to take extra actions compared to append-only action
invoke a term query to retrieve the all the fields
compute rollup logic and save to new fields
delete original doc
index new fields
I think the most concerned point is that you do not want change the lucene merge semantics, so I come up with a new approach to do rollup.
What do you think?
Currently, many OLAP engines support rollup feature like clickhouse(AggregateMergeTree)/druid.
Rollup definition: https://athena.ecs.csus.edu/\~mei/olap/OLAPoperations.php
One of the way to do rollup is to merge the same dimension buckets into one and do sum()/min()/max() operation on metric fields during segment compact/merge process. This can significantly reduce the size of the data and speed up the query a lot.
Abstraction of how to do
Assume the scenario
We use ES to ingest realtime raw temperature data every minutes of each sensor device along with many dimension information. User may want to query the data like "what is the max temperature of some device within some/latest hour" or "what is the max temperature of some city within some/latest hour"
In that way, we can define such fields and rollup definition:
The raw data will periodically be rolled up to the hour granularity during segment merge process, which should save 60x storage ideally in the end.
How we do rollup in segment merge
bucket: docs should belong to the same bucket if the dimension values are all the same.
How to define the logic
I have written the initial code in a basic level. I can submit the complete PR if you think this feature is good to try.
Migrated from LUCENE-10427 by Suhan Mao, updated Mar 29 2022