Benchmarking PostgreSQL queries

slargman commented 2 years ago

I anticipate that the database will be the performance bottleneck since it will be possible to create more instances of the q&a server to query the database, but I probably will not create new database instances since that would create the issue of keeping the databases in sync. So, optimizing the database queries is currently my focus for ensuring performance. In particular I'm working on optimizing the primary /products endpoint as that will require the most complex query.

I've created a simple benchmarking tool that can take different implementations of a query, iterate through the provided implementations and fire several concurrent queries (to simulate multiple clients) using an implementation, repeat this procedure for several trials, and finally provide some basic statistics on the performance of an implementation.

For the primary endpoint I need to query the questions table using the provided product_id then use the results of that query to determine the question_ids that can in turn be used to acquire the related answers from the answers table. Using the answer_ids from the answers I can then acquire the photos from the photos table and complete the queries.

Implementations tested

Sequential Questions -> Answers -> Photos with sequential steps: This implementation queries questions using the provided product_id, pulls the question_ids from the result, iterates through the question_ids to query answers and then iterate through the answers to query photos using the answer_ids.
Sequential Questions -> Answers -> Photos with parallel steps: As above, but once the question_id is acquired, the queries for the answers related to that question_id are fired in parallel.
Single query: This uses a single query with joins to pull all the required information from the database at once. Currently this is a bit of a toy implementation since it does not actually pull all of the required columns. I will update my schema to avoid conflicting column names in order to make implementing the actual query simpler.

Results

| ----------------------------------------------- | ---------------------- |
| Sequential Qs -> As -> Ps with sequential steps | 118.43902s +/- 2.94974 |
| Sequential Qs -> As -> Ps with parallel steps   |  80.58171s +/- 1.57070 |
| Single query                                    |  76.30626s +/- 1.58388 |
| ----------------------------------------------- | ---------------------- |
iterations: 5   clients: 100   total time: 22.9min

| ----------------------------------------------- | ---------------------- |
| Sequential Qs -> As -> Ps with sequential steps | 637.02769s +/- 0.00000 |
| Sequential Qs -> As -> Ps with parallel steps   | 421.13981s +/- 0.00000 |
| Single query                                    | 426.86207s +/- 0.00000 |
| ----------------------------------------------- | ---------------------- |
iterations: 1   clients: 500   total time: 24.8min

Limitations

The results only provide aggregate results. They give an idea of throughput, but do not provide any information on latency. Benchmarks with a single client can give some idea of latency although we would really like to know latency of a request when they are many requesting being sent to the service. This is possible to determine with my benchmarking tool, but would require some added logic.

Currently a random set of productIds is generated for each run of the benchmark suite, which limits comparability between different runs.

Ideally, I would generate a fixed set of random productIds drawing primarily from the first 10%, middle 10%, and last 10% of rows which would be reused for every benchmark run.

Analysis

Initial tests with only one client showed that the sequential with parallel steps was quicker than the single query. However, due to using multiple clients I worried that it would negatively affect throughput when thousands are requests are being made.

The results from this benchmark are maybe not statistically significant. Furthermore, the demonstrated performance difference between [2] and [3] is slight. Nevertheless, my focus going forward will be to optimize the single query implementation in order to minimize round trips to the database since I anticipate this will be a more significant factor when there are >1000 clients.

An earlier benchmark produced markedly different results:

| ----------------------------------------------- | ----------------------- |
| Sequential Qs -> As -> Ps with sequential steps | 124.90153s +/-  6.93308 |
| Sequential Qs -> As -> Ps with parallel steps   |  90.28772s +/-  2.52228 |
| Single query                                    | 106.80656s +/- 20.44773 |
| ----------------------------------------------- | ----------------------- |
iterations: 10   clients: 100   total time: 53.7min

However, my laptop went to sleep during that benchmark and due to the unusually high standard error for the single query I believe that this occurred while that implementation was being tested, invalidating the results.

Next Steps

[x] change database schema to avoid conflicting column names
[x] complete single query
[x] create indexes on foreign keys

slargman commented 2 years ago

The impact of indexing is drastic:

| ----------------------------------------------- | -------------------- |
| Sequential Qs -> As -> Ps with sequential steps | 0.09831s +/- 0.01931 |
| Sequential Qs -> As -> Ps with parallel steps   | 0.09727s +/- 0.00583 |
| Single query                                    | 0.03123s +/- 0.00465 |
| ----------------------------------------------- | -------------------- |
iterations: 5   clients: 100   total time: 0.0min

| ----------------------------------------------- | -------------------- |
| Sequential Qs -> As -> Ps with sequential steps | 0.60760s +/- 0.00000 |
| Sequential Qs -> As -> Ps with parallel steps   | 0.44901s +/- 0.00000 |
| Single query                                    | 0.11021s +/- 0.00000 |
| ----------------------------------------------- | -------------------- |
iterations: 1   clients: 500   total time: 0.0min

| ----------------------------------------------- | -------------------- |
| Sequential Qs -> As -> Ps with sequential steps | 6.76136s +/- 0.03090 |
| Sequential Qs -> As -> Ps with parallel steps   | 8.95055s +/- 0.07006 |
| Single query                                    | 2.02904s +/- 0.02300 |
| ----------------------------------------------- | -------------------- |
iterations: 10   clients: 10000   total time: 3.0min

At this point the question of database performance is conclusively settled. I will be using single queries for all routes and I can proceed to implementing my server.

slargman commented 2 years ago

A quick check for the impact of removing the WHERE reported = FALSE clauses seemed to show no significant performance impact:

| ------------------------- | -------------------- |
| Single query              | 1.37175s +/- 0.04535 |
| Don't filter for reported | 1.36799s +/- 0.02437 |
| ------------------------- | -------------------- |
iterations: 10   clients: 10000   total time: 0.5min

| ------------------------- | --------------------- |
| Single query              | 12.36841s +/- 0.03614 |
| Don't filter for reported | 12.87218s +/- 0.09524 |
| ------------------------- | --------------------- |
iterations: 10   clients: 50000   total time: 4.2min

| ------------------------- | --------------------- |
| Single query              | 29.63083s +/- 0.21004 |
| Don't filter for reported | 29.62052s +/- 0.12427 |
| ------------------------- | --------------------- |
iterations: 5   clients: 100000   total time: 4.9min

SDC-GoldEngine / questions-and-answers