We are aiming for designing and implementing a benchmark to evaluate the performance of analysis of time series databases.
Here is my understanding for designing for analytical workload
Dataset
In TSBS we have Metrics data from DevOps or IoT devices(e.g. CPU/memory utilization), there are still other kinds of time series data we can take into consideration:
Event: e.g. User log in/out on websites, IoT device turn on/off. The event data typically consists of timestamps and event type.
Logs: e.g. Log context from server/application/database/network devices. The log data typically consists of timestamps, log level, log content and backtrace
How to generate dataset is one of the main concern.
Deep learning(which I am not familiar with): refer to TSM-bench, it seems like more complicated but the generated data may be more
Statistical method: In some papers, they use Hidden Markov Chain to generate data or extract data pattern and generate more data with that pattern. In TPCDS, they use synthetic datasets that are built using well studied distributions such as the Normal or the Poisson distributions. They are mathematically very well defined and easy to implement in a data generator.
We should have a option to control
Data type: Metrics, Event, Logs
Scaling factor to control data size. Both domain and tuple should be scaled(refer to TPCDS)
Query
I collect some scenarios which involve the analysis performance in time series database
Data fetching: This is a very basic function, like select data by time range with some filter and aggregation
Anomaly detection: Detect the existence of abnormal value. This may involves downsampling
Prediction: Maybe it involves sliding window. Some of TSDB may have customized prediction functions or user-defined prediction function.
Trending: downsampling
Value filling: upsampling
Pseudo code for some queries:
Data fetching
SELECT time, id
FROM t
WHERE time > ts_start
AND time < ts_stop
AND a > value
2.Aggregation and Join
SELECT time, id, AVG(a), SUM(b)
FROM t
WHERE time > ts_start
AND time < ts_stop
SELECT time, id, AVG(t1.a), SUM(t2.b)
FROM t1 JOIN t2 ON t1.a = t2.b
WHERE time > ts_start
AND time < ts_stop
Downsampling
SELECT time, id, AVG(a), SUM(b)
FROM t
WHERE time > ts_start
AND time < ts_stop
GROUP BY id, time
SAMPLE BY 1H
Upsampling
SELECT time, id
FROM t
WHERE time > ts_start
AND time < ts_stop
SAMPLE BY 10s
FILL(LINER)
Test suite
Like TPCDS we can have the following tests:
Loading Test: evaluate the time to import raw data to DB(single/multi thread)
Power Test: Single thread performance
Throughput Test: Multi-thread performance
Maybe we should add some tests for ETL performance evaluation
Outputs
This is not a point that can be designed at the very beginning, in a nutshell benchmark is a tool to test the performance, so the most important thing is the time it takes to execute the query, when we have the import, single threaded, and multithreaded execution time we can figure out the metrics like THROUGHPUT and PRICE OVER PERFORMANCE.
We are aiming for designing and implementing a benchmark to evaluate the performance of analysis of time series databases.
Here is my understanding for designing for analytical workload
Dataset
In TSBS we have Metrics data from DevOps or IoT devices(e.g. CPU/memory utilization), there are still other kinds of time series data we can take into consideration:
How to generate dataset is one of the main concern.
We should have a option to control
Query
I collect some scenarios which involve the analysis performance in time series database
Pseudo code for some queries:
Data fetching
2.Aggregation and Join
Downsampling
Upsampling
Test suite
Like TPCDS we can have the following tests:
Loading Test: evaluate the time to import raw data to DB(single/multi thread)
Power Test: Single thread performance
Throughput Test: Multi-thread performance Maybe we should add some tests for ETL performance evaluation
Outputs
This is not a point that can be designed at the very beginning, in a nutshell benchmark is a tool to test the performance, so the most important thing is the time it takes to execute the query, when we have the import, single threaded, and multithreaded execution time we can figure out the metrics like THROUGHPUT and PRICE OVER PERFORMANCE.
reference: