Visible degradation of write throughput in CockroachDB as seen by Go client

sanjimoh commented 5 years ago

Describe the problem I'm running a POD in Kubernetes which basically is having a Go client spawning 32 Go routines each of which is firing a batch (size=4096) write only workload (multi-row inserts) to a secured clustered CockroachDB.

Over the period of time, I could see that the write throughput as seen by the Go client is gradually degrading. Over an hour of run on an empty table, I could see a peak write throughput of ~25K which gradually reduced to ~17K towards the end of the first hour.

I've also re-tested the scenario by reducing to 16 Go routines in the client side but results were still similar. Why there is such drastic degradation of throughput?

Refer to the below screenshot [ops/sec (cum) column] -

To Reproduce

Set up secured CockroachDB cluster (all default configuration of 3 node with RF=3) in Kubernetes. CockroachDB version used is the latest v19.1.0
Start benchmark run from the Go client POD. If requested can share the client code over support email channel.
The Go client POD logs would indicate the write throughput degrading over period of time.

Expected behavior There shouldn't a drastic change visible in the Write throughput as experienced by the client!

Additional data / screenshots Multi-row insert SQL query executed - INSERT INTO ABCD_TBL (id, person, attributes) VALUES (1, "person", "{}"),(2,"person","{}"),(3,"person","{}")....

In actual code a JSON string with 15 parameters is passed in the insert query for "attributes" column

Table schema used is as below -

Environment:

CockroachDB version [v19.1.0]
Kubernetes [v1.11.2]
Go Client app

ricardocrdb commented 5 years ago

Hey @sanjimoh

This is an interesting issue you are seeing. To start, we would need to get a better idea of a couple of factors at play. First, what type of hardware are the 3 nodes on? Can you clarify what CPU, memory, storage type?

Also, if you can provide me with an email address, I can provide a location for you to upload the go POD code in order to try to reproduce the same test. If that doesn't work, a location where I can find the files can be sent to my email at ricardo@cockroachlabs.com, if you'd prefer that method.

One more thing we would like to have an idea of is how large is the JSON payload on average for these inserts?

Let me know if you have any questions or concerns.

Cheers, Ricardo

sanjimoh commented 5 years ago

As requested, details are shared to the provided email.

sanjimoh commented 5 years ago

Hi Ricardo, did you get a chance to check this?

ricardocrdb commented 5 years ago

Hey @sanjimoh

Sorry for the delay, we needed to make a couple of changes to the code provided in order to get it working in our testing environment. I proceeded to set up some instances that are pretty close to what you mentioned you are running on, and at this point are just getting some baseline performance metrics. Please stand by, and as soon as I have some numbers to compare, I can share those with you.

Let me know if you have any other questions.

Cheers, Ricardo

ricardocrdb commented 5 years ago

Hey @sanjimoh

So I managed to complete a number of different tests, using your benchmark utility (I will call it benchmark from now on) and comparing those findings to using the tpcc workload. All tests were performed on a 3 node cluster with GCE n1-standard-8 machines with local SSDs, as they most closely matched the worker nodes you mentioned in your email. I also would remove the table abcd_tbl and recreate it fresh before each test. The results I found were pretty interesting, and I would like to know your thoughts.

The first test that was run was using benchmark set to 1 concurrent writer, with the 4096 batch size value. I let that run for approximately an hour and found that in the initial moments of the job, it was running at ~18k ops/sec, with a mean latency of ~190 ms. This eventually dropped to ~8k ops/sec after an hour, with the mean latency of ~330 ms. I noticed that the CPU was running at around 75% through the duration of the job on a single node in the cluster.

2019/05/29 18:48:47 Using default number of concurrent connections to db: 1 + 1
2019/05/29 18:48:47 Number of concurrent client threads to db: 1
2019/05/29 18:48:52 _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
2019/05/29 18:48:52       5s        0        18022.7        18022.5    159.4    184.5    184.5    184.5
2019/05/29 18:48:57      10s        0        18022.4        18022.5    167.8    184.5    184.5    184.5
2019/05/29 18:49:02      15s        0        19660.6        18568.6    151.0    167.8    167.8    167.8
2019/05/29 18:49:07      20s        0        19660.9        18841.7    142.6    167.8    184.5    184.5
2019/05/29 18:49:12      25s        0        18841.7        18841.7    151.0    176.2    209.7    209.7
2019/05/29 18:49:17      30s        0        18841.2        18841.6    151.0    176.2    192.9    192.9
...
2019/05/29 19:48:52 _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
2019/05/29 19:48:52   1h0m5s        0        10649.6        12248.2    302.0    335.5    335.5    335.5
2019/05/29 19:48:57  1h0m10s        0        11469.1        12247.2    302.0    335.5    352.3    352.3
2019/05/29 19:49:02  1h0m15s        0        10649.6        12244.9    302.0    335.5    352.3    352.3
2019/05/29 19:49:07  1h0m20s        0        10649.5        12242.7    302.0    318.8    318.8    318.8
2019/05/29 19:49:12  1h0m25s        0        10649.4        12240.5    318.8    318.8    335.5    335.5

^C2019/05/29 19:49:14 
_elapsed___errors_________blocks___ops/sec(cum)
2019/05/29 19:49:14  3627.9s        0       44404736        12239.9

2019/05/29 19:49:14 BenchmarkCockroachDBWrites  44404736         81699.9 ns/op

The second test that I ran using benchmark, this time set to 3 concurrent writers, with the same 4096 batch size. I configured a haproxy load balancer in front of the cluster. The initial metrics showed ~35k ops/sec, with a mean latency of ~275 ms. CPU usage was also very high at over 75% on all 3 nodes, as expected due to the load balancer. This at the end dropped to ~14k ops/sec and mean latency rose to over 700 ms.

2019/05/29 20:04:10 Using default number of concurrent connections to db: 3 + 1
2019/05/29 20:04:10 Number of concurrent client threads to db: 3
2019/05/29 20:04:15 _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
2019/05/29 20:04:15       5s        0        37684.6        37684.3    234.9    352.3    402.7    402.7
2019/05/29 20:04:20      10s        0        32767.1        35225.7    285.2    335.5    352.3    352.3
2019/05/29 20:04:25      15s        0        34406.3        34952.6    285.2    369.1    369.1    369.1
2019/05/29 20:04:30      20s        0        34407.0        34816.2    260.0    318.8    335.5    335.5
2019/05/29 20:04:35      25s        0        34406.3        34734.2    285.2    352.3    385.9    385.9
2019/05/29 20:04:40      30s        0        31129.3        34133.4    285.2    352.3    385.9    385.9
...
2019/05/29 21:07:35 _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
2019/05/29 21:07:35  1h3m25s        0        14745.9        17475.5    704.6    805.3    805.3    805.3
2019/05/29 21:07:40  1h3m30s        0        12287.8        17468.7    771.8    805.3    805.3    805.3
2019/05/29 21:07:45  1h3m35s        0        14745.7        17465.2    738.2    805.3    805.3    805.3
2019/05/29 21:07:50  1h3m40s        0        12287.7        17458.4    771.8    805.3    805.3    805.3
^C2019/05/29 21:07:52 
_elapsed___errors_________blocks___ops/sec(cum)
2019/05/29 21:07:52  3821.9s        0       66715648        17456.3

2019/05/29 21:07:52 BenchmarkCockroachDBWrites  66715648         57286.1 ns/op

The final test I ran was the tpcc workload. I went ahead and ran cockroach workload init tpcc --warehouses=50 to initialize the data, and then proceeded to run cockroach workload run tpcc --warehouses=50 --active-warehouses=50 --duration=60m in order to run the test. The performance seen was very steady throughout, utilizing a significantly lower percentage of CPU cycles, around 40% on all 3 nodes. The end of the test report is below, showing overall mean latency to be 37 ms, more than 10 times less then the values seen using benchmark.

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 3600.0s        0           3897            1.1     45.6     46.1     54.5     65.0    167.8  delivery
 3600.0s        0          38777           10.8     32.5     31.5     39.8     50.3    503.3  newOrder
 3600.0s        0           3890            1.1      5.8      5.8      7.3      8.4     71.3  orderStatus
 3600.0s        0          38765           10.8     15.5     14.7     19.9     29.4    436.2  payment
 3600.0s        0           3876            1.1     12.9     13.1     16.8     23.1    285.2  stockLevel

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
 3600.0s        0          89205           24.8     23.7     19.9     41.9     52.4    503.3  
Audit check 9.2.1.7: SKIP: not enough delivery transactions to be statistically significant
Audit check 9.2.2.5.1: PASS
Audit check 9.2.2.5.2: PASS
Audit check 9.2.2.5.3: PASS
Audit check 9.2.2.5.4: PASS
Audit check 9.2.2.5.5: PASS
Audit check 9.2.2.5.6: SKIP: not enough order status transactions to be statistically significant

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
 3600.0s      646.3 100.5%     32.5     31.5     37.7     39.8     50.3    503.3

The tpcc workload is a standard workload used by the industry, as it can be used for benchmarking purposes while also taking into account write contentions by design. The main culprit of the degradation may be due to the way that benchmark is handling submitting the requests to the CPU, as its simply throwing them rapid-fire and is, therefore, creating a large backlog that needs to be processed, hence the long latency times. Compare that to the tpcc workload that is submitting new requests as cycles are available. Let me know your thoughts, and if you have any additional questions.

Cheers, Ricardo

ricardocrdb commented 5 years ago

Hey @sanjimoh

Did you get a chance to take a look at the tests I ran? Let me know if you have any questions.

Cheers, Ricardo

sanjimoh commented 5 years ago

Hi Ricardo,

Sorry for the late response! Some critical stuffs kept me occupied.

Thank you for your feedback!

In the existing design, the Golang client is spawning a configured number of Go routines & each of these routines are then grouping the incoming workload to a configured number of batch size and then performing a multi-row insert. Once complete successfully, it repeats with a new set of batch.

In an OLTP system with a continuous workload, how do you then suggest the Golang client could handle the requests to get best out of CockroachDB?

Samusername commented 5 years ago

There can also be difference in performance: random ID vs sequential ID. Performance can stay some amount better when using sequential IDs. Could that be one reason. I have noticed such decrease in performance with two other DBMSs also: random ID vs sequential ID.

Following question was added to an other forum / thread. To be published there. [ An other, slightly different, topic:

CockroachDB nodes: 9 3 data replicas

If there is only one DB table and one data generator (having several threads), then performance is e.g. 30 000 objects / s.

If two DB tables are used and two data generators (both have e.g. 6 threads), slightly better performance was noticed. (Not tested enough yet.)

What is your expectation: can / should one very big DB table be used? Or should we split such DB table to smaller DB tables? (Or use partitioned table.) From achieving better performance point of view. ]

ricardocrdb commented 5 years ago

Hello @sanjimoh ,

At this point, after consulting with another engineer, it may be worth dropping the batch sizes to a number significantly smaller to start with, ie 8 or 16, and then up-ing the concurrent writers to a higher number may help the performance overall in your benchmarking utility. As the writes get progressively higher, more splits are occurring. With the larger batch sizes, it is more likely a higher chance of having to wait on the slower ranges and thus increasing the latency on the transactions.

Please play around with lower batch sizes, and higher concurrent writers, and let me know the results.

In reference to the 1 big table vs 2 tables @Samusername , it shouldn't really make a difference and having the 1 table would most likely be easier to work with.

Cheers, Ricardo

Samusername commented 5 years ago

Actually, does performance decrease when using sequential IDs? Trying to use:

 ctr            int64 = 0  

           // Random IDs usage can be replaced with sequential IDs. So, each thread uses:

            muID.Lock()
            for i := 0; i < *batch; i++ {  // This loop already existed.
                    id := ctr;
                    ctr = ctr + 1;
                    // id := bw.rand.Int63() // Faster to insert with random numbers?
                    ...
            }
            muID.Unlock()

// "ctr" variable can be seen by all (20) threads. Global "static" variable. // id is bigint in DB.

One data generator, which uses 20 "golang threads".

With java data generator, throughput was good with 4 threads, but decreased if more threads were used. Sequential IDs were used in java generator. (UUIDs were used also, in java data generator.) (golang data generator, todo: test also with less threads, sequential IDs, bigint.)

ricardocrdb commented 5 years ago

Hey @Samusername

Generally, performance is better when using random IDs versus sequential IDs. We have a section in the documentation explaining how when utilizing UUIDs can help to spread the workload. This can be found here.

@sanjimoh Has there been any progress on the tests with smaller batch sizes and larger concurrent writers?

Cheers, Ricardo

sanjimoh commented 5 years ago

Hi Ricardo,

Here are the results -

In the first set of results, I had kept the number of concurrent writers fixed to an optimal number of "16" & then slowly increasing the batch sizes.

(I know you haven't really asked for it but I couldn't resist my curiosity to see this. So please excuse! I'll as well publish the results of the other test subsequently. They still needs to be executed yet!)

Short runs of 15 mins

In this scenario, below were the CPU & memory usages of the Cockroach DB pods -

- Batch size = 8

NAME CPU(cores) MEMORY(bytes)
v11806-cockroachdb-cockroachdb-0 3170m 9495Mi
v11806-cockroachdb-cockroachdb-1 3319m 9194Mi
v11806-cockroachdb-cockroachdb-2 2949m 7704Mi

- Batch size = 16

NAME CPU(cores) MEMORY(bytes)
v11806-cockroachdb-cockroachdb-0 3072m 9778Mi
v11806-cockroachdb-cockroachdb-1 3012m 9456Mi
v11806-cockroachdb-cockroachdb-2 3952m 7903Mi

- Batch size = 32

NAME CPU(cores) MEMORY(bytes)
v11806-cockroachdb-cockroachdb-0 4303m 10204Mi
v11806-cockroachdb-cockroachdb-1 4274m 9464Mi
v11806-cockroachdb-cockroachdb-2 4156m 8168Mi

sanjimoh commented 5 years ago

In the second set of results, I had kept the number of batch size to a fixed value of "16" and then slowly increased the number of concurrent writers.

Short runs of 15 mins

In this scenario, below were the CPU & memory usages of the Cockroach DB pods -

- Concurrent Writers = 8

NAME CPU(cores) MEMORY(bytes)
v11806-cockroachdb-cockroachdb-0 2543m 9609Mi
v11806-cockroachdb-cockroachdb-1 2551m 9387Mi
v11806-cockroachdb-cockroachdb-2 2444m 8564Mi

- Concurrent Writers = 16

NAME CPU(cores) MEMORY(bytes) v11806-cockroachdb-cockroachdb-0 3072m 9778Mi v11806-cockroachdb-cockroachdb-1 3012m 9456Mi v11806-cockroachdb-cockroachdb-2 3952m 7903Mi

- Concurrent Writers = 32

NAME CPU(cores) MEMORY(bytes)
v11806-cockroachdb-cockroachdb-0 4227m 9698Mi
v11806-cockroachdb-cockroachdb-1 4961m 9569Mi
v11806-cockroachdb-cockroachdb-2 4617m 8770Mi

sanjimoh commented 5 years ago

The following run is similar to my second test set shared in the previous post; however this time I had run it for a longer duration (4 hours to be precise) compared to my earlier short duration runs (15 mins)

Overall, in all of the above test scenarios executed, I see consistent degradation of the throughput as in my initial tests for which I had raised this ticket.

Samusername commented 5 years ago

Sequential IDs topic, Sequential IDs, and 4 DB schemas, one DB table per schema (instead of one huge DB table). One generator inserts to own DB table. Generator (golang): 6 threads. (In following result case, json data values, in "key value" pairs, were not randomly generated, but increased by one for each row in DB.) Performance was very good. Sum, was way better. (Retesting needed still.)

Performance remained stable. 30 min execution.

ricardocrdb commented 5 years ago

Hey @sanjimoh

Apologies for the delay in response. What happens when you set the batch size to something very low, like 2, and give it a very high number of concurrent writers?

Cheers, Ricardo

ricardocrdb commented 5 years ago

Hey @sanjimoh

Do you have any updates on the latest batch size testing?

Cheers, Ricardo

sanjimoh commented 5 years ago

Hi Ricardo,

Below are the results for the requested scenario. Short runs for 15 mins.

ricardocrdb commented 5 years ago

Hello @sanjimoh

Judging from your initial results, we are seeing the same behavior where you get more consistent, lower latencies with a lower batch size, and a higher number of concurrent writers versus the early tests with a much larger batch size, and a lower number of concurrent writers. Have you tried to let these lower batch size tests run for multiple hours, as the tests we had first attempted in the lifetime of this issue? With that information, we can see where the bottleneck in your benchmarking utility is and work around that accordingly.

Let me know how it goes, and if you have any other questions.

Cheers, Ricardo

sanjimoh commented 5 years ago

Hi Ricardo,

I can try the longer runs but just wondering if it is the case that from the Golang client currently it is possible to only do '2' multi-row inserts per transaction?

With parallel concurrent writers, I can go as long or say proportionally as the number of CPU cores in the runtime. I can ofcourse scale but a batch size of 2 per transaction looks quite restricted to me 😔

Or is there anything I can do to improve my client code?

ricardocrdb commented 5 years ago

Hello @sanjimoh

I think that perhaps the main culprit of the degradation may be due to the way that benchmark is handling submitting the requests to the CPU, as its simply throwing them rapid-fire and is, therefore, creating a large backlog that needs to be processed, hence the long latency times. A more standardized benchmark will submit requests as others complete, and so are running more optimized, minimizing latency.

Let me know if you have any other questions.

ricardocrdb commented 5 years ago

Hey @sanjimoh

Do you have any other questions on this issue, or shall I go ahead and close it?

ricardocrdb commented 5 years ago

Closing due to inactivity. If you are still having the issue, please feel free to respond to this thread. We want to help!

cockroachdb / cockroach

Visible degradation of write throughput in CockroachDB as seen by Go client #37539