cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.97k stars 3.79k forks source link

kv: latency spike after tpcc workload ramps up #115431

Open ajstorm opened 10 months ago

ajstorm commented 10 months ago

An issue was observed on the CCT cluster on the tpcc_drop workload. The workload loops through the following steps:

  1. Renames any old tpcc database if one exists
  2. Loads a new tpcc workload with 3000 warehouses of data
  3. Drops the renamed workload
  4. Runs tpcc for 60 minutes

We've found that when the cluster was running on 23.1, every hour we'd see one latency spike as the IMPORT statements are run in step 2. After finalizing on 23.2, we're observing a second latency spike (as well as a throughput drop), approximately 5 minutes after the IMPORT and drop are complete and the tpcc workload has started ramping up.

23.1 (latency spikes confined to IMPORT window)

image

23.2 finalized (latency spikes occur after IMPORT is complete and workload is ramping up):

image

This is being discussed further here.

Jira issue: CRDB-34031

Epic CRDB-40201

kvoli commented 10 months ago

Looking at one of these instances between 16:30-17:00, as the workload begins scaling up, one of the nodes becomes more overloaded than others.

Details

![image](https://github.com/cockroachdb/cockroach/assets/39606633/c08698bf-aba3-4d6f-9c11-bae476ca6c58) ![image](https://github.com/cockroachdb/cockroach/assets/39606633/77f94623-3b4b-4d65-9ea8-6c336d32800e)

This gradually becomes represented in the balancing metric rebalancing.cpunanospersecond:

Details

![image](https://github.com/cockroachdb/cockroach/assets/39606633/00daa673-c79d-477f-bdf8-d59de0e0b619)

There are also a few load based splits on the same overloaded nodes on the recently imported table:

Details

![image](https://github.com/cockroachdb/cockroach/assets/39606633/1f07700d-fc79-42fc-96d4-ae3808f4bbdc) ``` I231205 16:45:12.459996 2660402565 kv/kvserver/replica_command.go:439 ⋮ [T1,Vsystem,n15,split,s15,r1086074/1:‹/Tenant/3/Table/5792{3-4}›] 3838895 initiating a split of this range at key /Tenant/3/Table/57923/1/‹487› [r1087889] (‹load at key /Tenant/3/Table/57923/1/487 (cpu 1.4s, 205.93 batches/sec, 151.18 raft mutations/sec)›) I231205 16:45:50.524596 2660723244 kv/kvserver/replica_command.go:439 ⋮ [T1,Vsystem,n15,split,s15,r1086075/1:‹/Tenant/3/Table/5792{4-5}›] 3838947 initiating a split of this range at key /Tenant/3/Table/57924/1/‹497›/‹1› [r1087890] (‹load at key /Tenant/3/Table/57924/1/497/1 (cpu 708ms, 369.36 batches/sec, 305.00 raft mutations/sec)›) I231205 16:46:59.538644 2661298309 kv/kvserver/replica_command.go:439 ⋮ [T1,Vsystem,n15,split,s15,r1086074/1:‹/Tenant/3/Table/57923{-/1…}›] 3839171 initiating a split of this range at key /Tenant/3/Table/57923/1/‹258› [r1087891] (‹load at key /Tenant/3/Table/57923/1/258 (cpu 522ms, 248.53 batches/sec, 110.54 raft mutations/sec)›) root@localhost:26257/application/defaultdb> select * from crdb_internal.tables where table_id=57923; table_id | parent_id | name | database_name | version | mod_time | mod_time_logical | format_version | state | sc_lease_node_id | sc_lease_expiration_time | drop_time | audit_mode | schema_name | parent_schema_id | locality -----------+-----------+-----------+---------------+---------+----------------------------+--------------------------------+--------------------------+--------+------------------+--------------------------+-----------+------------+-------------+------------------+----------- 57923 | 57921 | warehouse | cct_tpcc_drop | 6 | 2023-12-05 16:39:38.157682 | 1701794378157682423.0000000000 | InterleavedFormatVersion | PUBLIC | NULL | NULL | NULL | DISABLED | public | 57922 | NULL (1 row) Time: 343ms total (execution 315ms / network 28ms) root@localhost:26257/application/defaultdb> select * from crdb_internal.tables where table_id=57924; table_id | parent_id | name | database_name | version | mod_time | mod_time_logical | format_version | state | sc_lease_node_id | sc_lease_expiration_time | drop_time | audit_mode | schema_name | parent_schema_id | locality -----------+-----------+----------+---------------+---------+----------------------------+--------------------------------+--------------------------+--------+------------------+--------------------------+-----------+------------+-------------+------------------+----------- 57924 | 57921 | district | cct_tpcc_drop | 7 | 2023-12-05 16:39:35.618323 | 1701794375618323107.0000000000 | InterleavedFormatVersion | PUBLIC | NULL | NULL | NULL | DISABLED | public | 57922 | NULL (1 row) Time: 112ms total (execution 110ms / network 2ms) ```

The load imbalance is acted upon by shedding leases several minutes later from the overloaded node:

Details

``` I231205 16:53:11.694771 5699 13@kv/kvserver/store_rebalancer.go:467 ⋮ [T1,Vsystem,n15,s15,store-rebalancer,obj=‹cpu›] 622301 considering load-based lease transfers for s15 with ‹(queries-per-second=5136.2 cpu-per-second=2.2s)› load, mean=‹(queries-per-second=2249.6 cpu-per-second=1.9s)›, upperThreshold=‹(queries-per-second=2249.6 cpu-per-second=2.1s)› I231205 16:53:11.702036 5699 13@kv/kvserver/store_rebalancer.go:803 ⋮ [T1,Vsystem,n15,s15,store-rebalancer,obj=‹cpu›] 622302 transferring lease for r18960 load=‹(queries-per-second=433.3 cpu-per-second=112ms)› to store s9 load=‹(queries-per-second=1895.1 cpu-per-second=1.8s)› from local store s15 load=‹(queries-per-second=5136.2 cpu-per-second=2.2s)› I231205 16:53:11.706135 5699 13@kv/kvserver/replicate_queue.go:1075 ⋮ [T1,Vsystem,n15,s15,store-rebalancer,obj=‹cpu›] 622303 transferring lease to s9 I231205 16:53:11.765387 5699 13@kv/kvserver/store_rebalancer.go:580 ⋮ [T1,Vsystem,n15,s15,store-rebalancer,obj=‹cpu›] 622304 load-based lease transfers successfully brought s15 down to ‹(queries-per-second=4703.0 cpu-per-second=2.1s)› load, mean=‹(queries-per-second=2249.6 cpu-per-second=1.9s)›, upperThreshold=‹(queries-per-second=2249.6 cpu-per-second=2.1s)›) ```

This points to poor pre-distribution of the imported keyspace: (1) ranges requiring load based splits once the workload ramps up, when ideally the selected pre-split points would necessitate no load based splits, and (2) multiple such ranges all being leased on the same node.

kvoli commented 10 months ago

Load distribution aside, I'm noticing the number of querytxn.recv requests correlates pretty well with the node having high execution latency (usually also the node with higher CPU %).

image

This also shows up prominently in CPU profiles on affected nodes e.g. on n15 analyzed above:

cpuprof.2023-12-05T16_45_07.975.93.pprof

image

kvoli commented 10 months ago

I didn't notice a difference in behavior between 23.1.12 and 23.2.0-beta.1 when trying to repro with:

v23.1.12

``` export cluster=austen-tpcc-import-spike-231 rp create $cluster -n 8 --gce-machine-type=n2-standard-8 --local-ssd=false rp stage $cluster release v23.1.12 rp start $cluster:1-7 rp run $cluster:8 -- './cockroach workload run kv --drop --max-rate=5000 --read-percent=50 --splits=2000 {pgurl:1-7}' rp sql $cluster:1 -- -e 'drop database tpcc' sleep 600 rp run $cluster:8 -- './cockroach workload init tpcc --warehouses=1500 {pgurl:1-7}' sleep 60 rp run $cluster:8 -- './cockroach workload run tpcc --warehouses=1500 --duration=60m {pgurl:1-7}' ```

v23.2.0-beta.1

``` export cluster=austen-tpcc-import-spike-232 rp create $cluster -n 8 --gce-machine-type=n2-standard-8 --local-ssd=false rp stage $cluster release v23.2.0-beta.1 rp start $cluster:1-7 rp run $cluster:8 -- './cockroach workload run kv --drop --max-rate=5000 --read-percent=50 --splits=2000 {pgurl:1-7}' rp sql $cluster:1 -- -e 'drop database tpcc' sleep 600 rp run $cluster:8 -- './cockroach workload init tpcc --warehouses=1500 {pgurl:1-7}' sleep 60 rp run $cluster:8 -- './cockroach workload run tpcc --warehouses=1500 --duration=60m {pgurl:1-7}' ```

image

Grafana (internal) link.

The only subtle difference is that the 23.2 import creates less splits, however the load distribution doesn't materially differ.

The spike of querytxns from pushes at the start of running TPCC is interesting and I'm suspecting this isn't something new in 23.2, but rather the cct test cluster was running hotter by the time the 23.2 upgrade finalized.

kvoli commented 10 months ago

I chatted w/ @nvanbenschoten. The txn records for some of the queries running will be written into the district/warehouse tables more commonly than others. These import pre-distribution couldn't know this a-priori and the higher load on these few tables upon contention is not unexpected.

The difference pre vs post finalization to 23.2 is most likely an artifact of higher load generally and a feedback loop that evolves with txn's getting pushed, in-turn causing higher load etc.

The reproduction above does not demonstrate a regression between 23.1 and 23.2. I'm going to remove the GA-blocker label.

ajstorm commented 10 months ago

Thanks for investigating @kvoli, and for ruling out the GA blocker status. Is the higher load when moving to 23.2 expected? If so, what is the cause?

kvoli commented 9 months ago

Thanks for investigating @kvoli, and for ruling out the GA blocker status. Is the higher load when moving to 23.2 expected? If so, what is the cause?

The higher load moving to 23.2. is not expected, but its also not evident that moving to 23.2 caused higher load. We would need to reproduce a higher load in 23.2 than 23.1 to figure out the cause. The reproduction attempt above doesn't reproduce higher load but also didn't run for a long period like the CCT cluster.

ajstorm commented 9 months ago

I may have a theory as to what could have caused the higher load:

  1. For a while, the schemachange workload which was running on the cluster had died. We restarted it shortly after the finalization, which could partially contribute to the increased load.
  2. We just uncovered a bug which prevents ranges from being GCed. This caused the number of ranges on the cluster to grow to almost 400K. It's possible that this caused any O(num ranges) background work to run hotter. This is being worked on with https://github.com/cockroachdb/cockroach/issues/116240.

We should probably hold off on a continued investigation here until we rule out either of these two factors as causes.

ajstorm commented 8 months ago

I think we can safely rule out the two theories above. For the first one, we now have a dashboard which tracks schema change operations and they don't seem correlated with the dip in the TPC-C workload.

For the second one, the cluster has been running for a while at > 20K ranges, and we're still seeing the issue.

I've passed the issue off to @tusharCRL for now, but he may want to work with @kvoli to continue the investigation.

rafiss commented 5 months ago

Could this be related to the observations seen in https://github.com/cockroachdb/cockroach/issues/119924