Continuous aggregates / downsampling

morigs commented 1 year ago

What problem does the new feature solve?

Eventually, data volume becomes so huge that it's impossible to retrieve data fast. Also, it's impossible to keep historical data (you have to remove old data to cut costs). Some timeseries and analytical databases support downsampling and/or continuous aggregates. For example:

What does the feature do?

It would be awesome to have realtime aggregation capabilities as in materialize, pipelinedb, timescaledb, clickhouse and mongo. However, this can be really challenging. So, perhaps, it simpler to implement downsampling only (as in traditional timeseries databases). This should solve most of problems with data storage. Ideally, it should be possible to query data without concerning about sampling tables (as it works in thanos). And it should be possible to specify different retention periods (TTL) for each downsampling table.

Implementation challenges

No response

waynexia commented 1 year ago

Hi @morigs, this is a splendid suggestion, really appreciate it!

Downsampling is definitely a necessary feature for DB like us. And it's on the middle-term plan because of some dependence. As you may know, our storage engine is an LSM-Tree-like structure. But now we have only implemented the L0 part. So to achieve things like downsampling or TTL described in #601 we need to have the compaction functionality done (the feature-request ticket hasn't been opened for now). But anyway downsampling will be included in the roadmap (under writing, likely to release next week) no doubt!

killme2008 commented 1 year ago

Thanks for your suggestion. Yep, continuous aggregation/downsampling is an important feature of TSDB, we should support it too. And we want to support it not only by SQL but also by python, I think it would be really cool if you can create a continuous aggregation/downsampling job in python.

fredrikIOT commented 6 months ago

Hello, Has there been any progress on this feature? Additionally, I'm curious if it will support backfilling. In my use case, I frequently delete and update a range of historical data based on certain calculations. Will these changes be reflected and re-aggregated automatically in the continuous aggregate jobs?

discord9 commented 6 months ago

Hello, Has there been any progress on this feature? Additionally, I'm curious if it will support backfilling. In my use case, I frequently delete and update a range of historical data based on certain calculations. Will these changes be reflected and re-aggregated automatically in the continuous aggregate jobs?

Sorry for the delay in implement this feature, yes we are working on trying to create a much simpler version of differential dataflow integrated into our database which can capture data change in a certain capacity, that is to update with backfilling, however we are currently aiming for simple query like compose of avg, sum, count or min/max with backfilling support(allow delete or change to historic data in a limited way), if you will elaborate a bit on your possible usecase maybe it will be useful when we design features for the stream process system?

fredrikIOT commented 6 months ago

Sorry for the delay in implement this feature, yes we are working on trying to create a much simpler version of differential dataflow integrated into our database which can capture data change in a certain capacity, that is to update with backfilling, however we are currently aiming for simple query like compose of avg, sum, count or min/max with backfilling support(allow delete or change to historic data in a limited way), if you will elaborate a bit on your possible usecase maybe it will be useful when we design features for the stream process system?

Thank you for your response. My use case involves working primarily with raw data while maintaining a time-series dataset that is downsampleed based on averages over certain time windows. This is where I plan to utilize continuous aggregates.

The critical aspect of my use case is handling data deletions and updates. My services frequently delete data within specific windows. This deletion can be complete removal or replacement of data. When data is replaced, it often involves two scenarios:

Replacement with different numerical values.
Resampling where data points are shifted, such as a 0.5-second shift.

This is where I need the continuous aggregates to reflect these changes accurately. The ability to handle such dynamic data updates and deletions efficiently in continuous aggregates would be highly beneficial for my application.

I haven't started using GreptimeDB extensively yet, but I'm currently testing it to determine its suitability for my needs. Specifically, I haven't tested the removal and backfilling of data, so I'm interested in knowing if GreptimeDB supports these operations and if its compaction processor can rearrange backfilled data effectively.

discord9 commented 6 months ago

Sorry for the delay in implement this feature, yes we are working on trying to create a much simpler version of differential dataflow integrated into our database which can capture data change in a certain capacity, that is to update with backfilling, however we are currently aiming for simple query like compose of avg, sum, count or min/max with backfilling support(allow delete or change to historic data in a limited way), if you will elaborate a bit on your possible usecase maybe it will be useful when we design features for the stream process system?

Thank you for your response. My use case involves working primarily with raw data while maintaining a time-series dataset that is downsampleed based on averages over certain time windows. This is where I plan to utilize continuous aggregates.

The critical aspect of my use case is handling data deletions and updates. My services frequently delete data within specific windows. This deletion can be complete removal or replacement of data. When data is replaced, it often involves two scenarios:

Replacement with different numerical values.

Resampling where data points are shifted, such as a 0.5-second shift.

This is where I need the continuous aggregates to reflect these changes accurately. The ability to handle such dynamic data updates and deletions efficiently in continuous aggregates would be highly beneficial for my application.

I haven't started using GreptimeDB extensively yet, but I'm currently testing it to determine its suitability for my needs. Specifically, I haven't tested the removal and backfilling of data, so I'm interested in knowing if GreptimeDB supports these operations and if its compaction processor can rearrange backfilled data effectively.

Then i am sorry to say our feature regarding this is still under active develop now and nowhere near useful, but your usecase is definitely where we are heading for with continuous aggregate, meanwhile I would suggest database like materialize for this kind of change data capture(it's really sort of a clever cached materialized view though, nothing timeseries about it).

fredrikIOT commented 6 months ago

Then i am sorry to say our feature regarding this is still under active develop now and nowhere near useful, but your usecase is definitely where we are heading for with continuous aggregate, meanwhile I would suggest database like materialize for this kind of change data capture(it's really sort of a clever cached materialized view though, nothing timeseries about it).

Thank you for your response and recommendation. I understand that the continuous aggregate feature is still under active development. I am also curious about GreptimeDB's capabilities in its current state, particularly regarding the deletion, replacement, and backfilling of data within specific windows. Will the time series data be compacted properly during these processes (on inserts, not continuous aggregates)? From what I understand, GreptimeDB uses a compactor service (similar to InfluxDB IOx) to optimize time series data for querying. Could you provide some insights into this?

waynexia commented 6 months ago

I am also curious about GreptimeDB's capabilities in its current state, particularly regarding the deletion, replacement, and backfilling of data within specific windows.

Frequent deletion or duplication (replacement) will impact read performance for uncompacted data and won't affect write performance. Once the compaction is scheduled on corresponding data, queries can reach the best performance as there hasn't been any deletion/backfilling.

So if your queries run on fresh modification (except writing new data), then deletion or backfilling will bring some performance decline. But if most of your queries run on the latest, unmodified data. And those modified time windows are only for offline, non-ad-hoc queries, then it's impact is very limited.

fredrikIOT commented 6 months ago

So if your queries run on fresh modification (except writing new data), then deletion or backfilling will bring some performance decline. But if most of your queries run on the latest, unmodified data. And those modified time windows are only for offline, non-ad-hoc queries, then it's impact is very limited.

Thanks, that makes sense. My primary concern was whether the modified data would eventually be scheduled for compaction. From your response, it seems that while there might be a temporary performance decline due to deletion or backfilling, the data will indeed be compacted after some time, restoring optimal performance. Have I understood that correctly?

waynexia commented 6 months ago

So if your queries run on fresh modification (except writing new data), then deletion or backfilling will bring some performance decline. But if most of your queries run on the latest, unmodified data. And those modified time windows are only for offline, non-ad-hoc queries, then it's impact is very limited.

Thanks, that makes sense. My primary concern was whether the modified data would eventually be scheduled for compaction. From your response, it seems that while there might be a temporary performance decline due to deletion or backfilling, the data will indeed be compacted after some time, restoring optimal performance. Have I understood that correctly?

Precisely 👍

austin-barrington commented 3 months ago

Hey, in Influxdb this is actually handled by an external program called kapacitor. It's job is just to query data pulling it locally to the program, then writing the results back. This is a good middle ground if anyone has a need to run aggregation for the time being.

discord9 commented 3 months ago

Hey, in Influxdb this is actually handled by an external program called kapacitor. It's job is just to query data pulling it locally to the program, then writing the results back. This is a good middle ground if anyone has a need to run aggregation for the time being.

We are developing our dataflow module in a similar way, but instead of inter-process commumication wee choose to do it in a inter-thread way, while dataflow module stay with datanodes and reduce some ser/de cost.(Which is also what kapacitor is doing in influxdb 2.0)

killme2008 commented 1 month ago

It's tracked in #3187 and the flow engine was released in v0.8

GreptimeTeam / greptimedb