running mutiple different queries in the same time to simulate real world dashboards

Beuth-Erdelt / DBMS-Benchmarker

DBMS-Benchmarker is a Python-based application-level blackbox benchmark tool for Database Management Systems (DBMS). It connects to a given list of DBMS (via JDBC) and runs a given list of parametrized and randomized (SQL) benchmark queries. Evaluations are available via a Python interface and on an interactive multi-dimensional dashboard.

GNU Affero General Public License v3.0

12 stars 2 forks source link

running mutiple different queries in the same time to simulate real world dashboards #120

Closed AmosG closed 1 week ago

AmosG commented 1 year ago

Given that in the day to day usage Users enter a dashboard that kicks off between 10-50 queries for each user

It would be idle to be able to simulate the same behaviour using the benchmarker

AmosG commented 1 year ago

Sorry but i think i found that this would work: working-querywise-or-connectionswise

Working querywise or connectionswise This options sets if benchmarks are performed per query (one after the other is completed) or per connection (one after the other is completed).

This means processing -w query is

loop over queries q loop over connections c making n benchmarks for q and c compute statistics save results generate reports and processing -w connection is

loop over connections c loop over queries q making n benchmarks for q and c compute statistics save results generate reports

perdelt commented 1 year ago

Hello @AmosG I assume dbmsbenchmarker is not well prepared for your use case. A typical use case would be running a sequence of complicated and parameterized (and slow) queries like in TPC-H and TPC-DS. The result is retrieved, stored and compared. Stress for the DBMS comes from high repetition, not parallel execution.

The querywise or connectionwise parameter itself is not parallelizing anything.

However I think dashboards are a valid and interesting use case.

As an example, what does a query look like?

What would you expect dbmsbenchmarker to report after the benchmark? The total throughput?

deenar commented 1 year ago

Dashboard queries https://www.geckoboard.com/dashboard-examples/executive/sql-dashboard/ are no different than typical OLAP style queries, aggregate measures grouped by certain dimensions, some trends, top x ordered lists of certain entities, all this run for comparisons between 2 or more time periods.

The only difference is that these queries are fired at once from a single client. IMO , if you have --numProcesses quite large one should get the same numbers and over a period of time all the dashboard queries would run in parallel across multiple benchmark clients. One wouldn't get stats at a dashboard level though.

not parallel execution.

When you mention not parallel execution, I presume you meant for a set of queries, not across processes

perdelt commented 1 year ago

Yes, you are right. As an example, here is a config that runs 200 queries in 10 parallel batches of 20 queries each:

{
  'name': 'Demo for parallel queries',
  'connectionmanagement': {
        'timeout': 100,             # in seconds
        'numProcesses': 10,         # number of parallel client processes
        'singleConnection': False,  # allow connection not per stream
        'runsPerConnection': 20,    # reconnect for every 20 executions
  },
  'queries':
  [
    {
      'title': "Compute random result",
      'query': "SELECT 1+{NUMBER} AS result",
      'numRun': 200,                # total number of runs
      'parameter': {
        'NUMBER': {
          'type': "integer",
          'range': [1,100]
        },
      },
    },
  ]
}

Note the following limitations:

Python has a GIL and
there will be a JVM for each batch (effect on performance unknown).
All 200 queries come from the same template.
Variation may be put in by randomization.

Does this fit your use case? parallel_execution

deenar commented 1 year ago

Thank you, that will work. I have a clarification. Is runsPerConnection': 20, required. What is the behaviour if this is left out?

perdelt commented 1 year ago

It is redundant at this point, but in general it is recommended to specify these parameters completely. If there is a contradiction in parameters about number of processes, size of batches etc, it cannot be guaranteed bahaviour is as expected.