GreptimeTeam / greptime-bench

Time series workload benchmark suite
2 stars 0 forks source link

Discussion: Benchmark for time series analytical databases #2

Closed Dysprosium0626 closed 3 months ago

Dysprosium0626 commented 3 months ago

We are aiming for designing and implementing a benchmark to evaluate the performance of analysis of time series databases.

Here is my understanding for designing for analytical workload

Dataset

In TSBS we have Metrics data from DevOps or IoT devices(e.g. CPU/memory utilization), there are still other kinds of time series data we can take into consideration:

  1. Event: e.g. User log in/out on websites, IoT device turn on/off. The event data typically consists of timestamps and event type.
  2. Logs: e.g. Log context from server/application/database/network devices. The log data typically consists of timestamps, log level, log content and backtrace

How to generate dataset is one of the main concern.

  1. Deep learning(which I am not familiar with): refer to TSM-bench, it seems like more complicated but the generated data may be more
  2. Statistical method: In some papers, they use Hidden Markov Chain to generate data or extract data pattern and generate more data with that pattern. In TPCDS, they use synthetic datasets that are built using well studied distributions such as the Normal or the Poisson distributions. They are mathematically very well defined and easy to implement in a data generator.

We should have a option to control

  1. Data type: Metrics, Event, Logs
  2. Scaling factor to control data size. Both domain and tuple should be scaled(refer to TPCDS)

Query

I collect some scenarios which involve the analysis performance in time series database

  1. Data fetching: This is a very basic function, like select data by time range with some filter and aggregation
  2. Anomaly detection: Detect the existence of abnormal value. This may involves downsampling
  3. Prediction: Maybe it involves sliding window. Some of TSDB may have customized prediction functions or user-defined prediction function.
  4. Trending: downsampling
  5. Value filling: upsampling

Pseudo code for some queries:

  1. Data fetching

    SELECT time, id 
    FROM t 
    WHERE time > ts_start 
    AND time < ts_stop
    AND a > value

    2.Aggregation and Join

    SELECT time, id, AVG(a), SUM(b)
    FROM t 
    WHERE time > ts_start 
    AND time < ts_stop
    SELECT time, id, AVG(t1.a), SUM(t2.b)
    FROM t1 JOIN t2 ON t1.a = t2.b
    WHERE time > ts_start 
    AND time < ts_stop
  2. Downsampling

    SELECT time, id, AVG(a), SUM(b)
    FROM t 
    WHERE time > ts_start 
    AND time < ts_stop
    GROUP BY id, time
    SAMPLE BY 1H
  3. Upsampling

    SELECT time, id 
    FROM t 
    WHERE time > ts_start 
    AND time < ts_stop
    SAMPLE BY 10s
    FILL(LINER)

    Test suite

    Like TPCDS we can have the following tests:

  4. Loading Test: evaluate the time to import raw data to DB(single/multi thread)

  5. Power Test: Single thread performance

  6. Throughput Test: Multi-thread performance Maybe we should add some tests for ETL performance evaluation

Outputs

This is not a point that can be designed at the very beginning, in a nutshell benchmark is a tool to test the performance, so the most important thing is the time it takes to execute the query, when we have the import, single threaded, and multithreaded execution time we can figure out the metrics like THROUGHPUT and PRICE OVER PERFORMANCE.

reference:

  1. TSM-Bench: Benchmarking Time Series Database Systems for Monitoring Applications:https://www.vldb.org/pvldb/vol16/p3363-khelifati.pdf https://www.odbms.org/2023/12/on-benchmarking-time-series-database-systems-for-monitoring-applications-qa-with-abdelouahab-khelifati-and-mourad/
  2. SciTS: A Benchmark for Time-Series Databases in Scientific Experiments and Industrial Internet of Things:https://arxiv.org/pdf/2204.09795
  3. YCSB-TS: https://github.com/TSDBBench/YCSB-TS
  4. TS-Benchmark: A Benchmark for Time Series Databases:https://www.benchcouncil.org/bench2018/chenyueguo.pdf
  5. The Making of TPC-DS: https://www.tpc.org/tpcds/presentations/the_making_of_tpcds.pdf