Periodic performance test requirements

With #123, we've a pretty solid infrastructure to automate this. However, running the tests automatically is not useful enough, we should be considering several other things as well.

For instance, network latency characteristics between two different clusters is almost always very different. So, we should definitely make sure to run two selected versions of Citus on the same cluster.

Other than that, few months back we've compiled a list of requirements that we'd love to have in an automated performance test. I'm copy&pasting it. Once we prioritize implementing this, we should consider covering some of the items below:

Tracking Citus' Performance on Azure

Immediate requirements: Get the performance automation in a useful state (1-2 weeks)

1.1  Make sure that the results are continuously saved to the database

Citus PgBench pipeline spins up a new Citus cluster and runs pgbench.

The tool is expected to save the results to a database (psql -h 40.80.217.255  -U citus postgres). However, the latest result is saved on 2019-03-14 21:23:00.037881, which is almost a month back.

@Paul Epperson mentioned that there are few issues related to the CI infrastructure, and they are about to fix it. We should follow-up on that. Paul is the primary contact person.

1.2  Save the latencies of pgbench tests along with the tps

As mentioned (1), the results of each test is saved to a database. This is very useful, but, missing one important information: the latencies.

As we’ve examined the existing results, we see some fluctuations in the query tps (transaction per second). That might very well be related to the unique latency characteristic of the cluster. So, without latencies it becomes tricky to understand the fluctuations. 

Paul mentioned me that he started something like this. So, let’s talk to him before doing anything.

1.3  Run the tests automatically

As far as I understand, as of today the performance tool is triggered manually. Additional to that, is it possible to automatically run it, say daily or weekly?

Now running the tool is the initiative of the developers, which is a bad idea going forward because very likely to be forgotten.

Short term goals: Improve the experience (2-3 weeks)

2.1  Tune the pgbench parameters
Such that we hit %100 CPU utilization while running select-only. Similarly, tune the settings so that we get highest tps for the given cluster. (This might already be the case, but worth double checking.)

2.2  Notify on failures
On failures to run the tests (or failure on any step of the process), send a notification e-mail to (citus on azure eng? or better to create a smaller group citus perf on azure eng?). Today, someone should manually keep track of whether the test has successfully been executed or not. For example, we've realized the failures 2 weeks after.

2.3  Notify on sharp drops/increases
On sharp drops in the performance (say 10-15%) compared to the previous test, send a notification e-mail to the "citus perf on azure eng" group (which doesn't exist yet).

2.4  Create a dashboard to see the results
Create a dashboard that shows the results. Otherwise, we’d need to use psql and analyze the data manually, which is a pretty bad user experience for performance tracking.

Can we have a PowerBI template to see the performance by just typing the database credentials?

Or, could we have this dashboard on https://msdata.visualstudio.com? This page should consider the other types of benchmarks that we might implement with (3), so let's implement with some flexibility.

Midterm goals: Increase the test coverage (3-4 weeks)

3.1  Pgbench queries should hit the disk as well
Currently all the pgbench tests are running in memory. It'd be great to have some tests where the data does not fit into the memory since that could potentially benchmark a very different angle.

3.2  Use custom pgbench scripts
Today we are always using the pre-defined pgbench tests. We should be able to use custom test scripts to have a lot more coverage of the various uses cases that Citus supports. For example, we do the scale tests on release tests, and we might consider something like this:https://github.com/citusdata/test-automation/tree/master/fabfile/pgbench_scripts.

This requires some preparation since the scale tests are not in a sufficient quality as of today. We should simulate some more real-world use-cases with custom scripts.

3.3  Use other benchmarks as well
Today we are only running pgbench tests. They are important and useful for tracking the performance. We should consider other benchmarks like tpcc or tpch. @Michal Primke mentioned that sterling postgresql team has already few benchmarks implemented.

Learn what are those. Could we use them as well? Or, could we inspired by them?

3.4  Change cluster configuration
Today we've a pre-defined cluster (1+4 whatever instance type). We should be able to run the tests on different sized clusters as well. For example, run the tests with a small amount of memory.

Long term goals: Involve shard moves/splits (1-2 weeks)

Shard moves is an essential part of Citus. Tracking performance (and with some minimal effort correctness) would be valuable. Today, we say that the shard moves are online (e.g., modifications are not blocked). However, we have no data points on the actual performance of Citus during the shard moves.

4.1  Rebalance + INSERT
Run continuous rebalance (e.g., while True: pick a random shard and move) concurrent with INSERTs. End of the test, make sure that all the INSERTs are successfully done. Save the tps and latencies as well. On any failures, send e-mail.

4.2  Rebalance + UPDATE
Run continuous rebalance (e.g., while True: pick a random shard and move) concurrent with UPDATEs. End of the test, make sure that all the UPDATEs are successfully done. Save the tps and latencies as well. On any failures, send e-mail.
… (and SELECT/DELETE)

citusdata / test-automation

Periodic performance test requirements #124