itisacloud commented 12 months ago

Description

We need a Python script to automate benchmarking of ClickHouse DB performance by running a predefined list of queries against it. Currently, we perform these tests manually with Jochen Stier and Johannes Visintini. Jochen has a method to run queries without using cached files, and we anticipate these queries to be slow.

Once the files are loaded into RAM, we expect the queries against ClickHouse to be much faster. Therefore, the script should provide timing information for these queries, considering different conditions such as whether the files are in RAM or not. For example, we have observed that the number of CPUs impacts the aggregation speed, so the script should capture this information as well.

It would be ideal if the script could output the results in a table or HTML page format, displaying metrics such as minimum, maximum, mean, and median for each query/endpoint under different conditions (files in RAM vs. files not in RAM).

Additionally, this benchmarking tool will be valuable in the future to assess the impact of constantly inserting new data on query performance. Although we don't anticipate any issues, it would be beneficial to have an easy way to test this scenario.

Please investigate if there are any existing frameworks that already support this type of analysis.

Desired Features:

Automation of benchmarking queries against ClickHouse DB.
Support for running queries without cached files (slow queries).
Measurement of query execution time with files in RAM.
Comparison of query performance under different conditions (files in RAM vs. files not in RAM).
Output generation in the form of a table or HTML page.
Inclusion of metrics (min, max, mean, median) for each query/endpoint.
Capability to assess the impact of constantly inserting new data on query performance.
Investigation of existing frameworks that support similar analysis.

Additional Information

Manual benchmarking has been performed with Jochen Stier and Johannes Visintini. Jochen Stier has a method to run queries without cached files, which are expected to be slow. Queries against ClickHouse are expected to be faster when files are loaded in RAM. Aggregation speed has shown to be influenced by the number of CPUs. The script's output should facilitate easy testing of new data insertion impact on query performance. Research existing frameworks that support comparable analysis

Hagellach37 commented 4 months ago

@ElJocho @mmerdes what should we do with this issue? there is a script, but probably it's outdated. I think it might be best to remove it again and then close this issue?

ElJocho commented 4 months ago

Hey @Hagellach37 , I update the Gatling scripts recently to also include queries to the topics endpoints, so we can close it, because it is done :)

The small scripts should probably be deleted, because they are not used

GIScience / ohsome-now-stats-service

add simple Benchmarking to evaluate performance. #23

Description

Desired Features:

Additional Information