kv: high CPU load on an idle cluster

andrewbaptist commented 7 months ago

Describe the problem

While investigating an unrelated issue, I noticed unexpectedly high CPU (~40-50%) on an idle cluster which had a large number of ranges. In order to track down the cause, I created a 12 node cluster with 400K total replicas (with RF=3 this is ~100K per replica). I observed the CPU both before and after creating the replicas and they are attached to this along with a breakdown of where the CPU is being used.

I noticed several other anomalies for high range systems that should be tracked down separately.

In my reproduction, the CPU was only ~25% I expect it would have been higher if I had changefeeds enabled.

Before: Ranges: 68 Average CPU

To Reproduce

Create a cluster:

roachprod create -n 12 $CLUSTER
roachprod put $CLUSTER artifacts/cockroach
roachprod start $CLUSTER

# Wait 15 minutes then collect CPU metrics and analyze CPU profiles

roachprod run $CLUSTER:1 "./cockroach workload init kv --splits 400000 $(roachprod pgurl $CLUSTER:1)"
roachprod run $CLUSTER:1 "./cockroach workload run kv  $(roachprod pgurl $CLUSTER:1-12) --duration 10m --concurrency 100"
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
  600.0s        0        2515834         4193.1     23.8      3.3    104.9    302.0   3892.3

# Wait 15 minutes then collect CPU metrics and analyze CPU profiles

Note: The splits took surprisingly long to complete. I had expected 1-2 minutes, but it was ~1 hours to complete the splits as it only executes ~120 splits/second when it starts. I added a profile for it, but I'm not filing a bug since this is not a "normal" split. About half the time during the splits was spent in pebble compaction. I don't know if this would happen on a normal system. The splits did seem to slow down to ~100 splits/second by the end.

Note: During the splits, the two hot ranges were range 1 (which didn't split - see #119421) and the rangelog ranges (which did split). I manually split meta1 to attempt to relieve pressure which removed meta1 as a hot range and split meta2 into multiple smaller ranges. This didn't change the split speed since I think with sequential splits we had a moving hotkey on meta2.

Note: On the range report, the Range Log kept failing with an internal server error. If I manually hit the URL it attempts to retrieve from ( http://baptist-test-0001.roachprod.crdb.io:26258/_admin/v1/rangelog/326154?limit=-1) I get a 500 error (An internal server error has occurred. Please check your CockroachDB logs for more details. Error Code: 40001) after ~10 seconds. If I change the limit to something smaller (100) it takes a long time to run (~2 minutes) but succeeds. I didn't dig into this, but it would be useful to track down as well (and possibly file a different bug)

Note: The ops/sec was surprisingly low for this cluster. Normally I would expect ~10K QPS but this system was getting ~4K and varied wildly (from 100 up to 9000 on a per-second basis). The p99 latency was >100ms (and pMax >500ms). Again I didn't file a bug but we might want to track this down separately. I believe this was due to pebble compactions, but it was hard to gather when this was happening as it was only on 1 node at a time but seemed to correlate with large disk read IOPS on one node.

Note: I encountered 4 liveness heartbeat failures during the init/split phase that were unexpected. And a could during the fill phase also.

Sample URL to take a CPU profile http://baptist-test-0001.roachprod.crdb.io:26258/debug/pprof/ui/cpu/?node=2&seconds=30

Profile before the splits were done: profile-before.pb.gz

Wait ~15 minutes for the cluster to stabilize, look at CPU usage on a node and take a 30s CPU profile (): Average CPU = 1.5% Download and look at the top consumers (note these are % of the 1.5%) runSingleHeartbeat - 10% raft.processReady - 5% baseQueue.maybeAdd - 5% rpc._Heartbeat_Ping_Handler - 5% Gossip - 1.8%

Profile during the splits Average CPU = 75% profile-during-splits.pb.gz ~50% of time spent in pebble.compact and pebble.collectTableStats

Profile while running the workload (for reference) Average CPU = 58%

profile-during-fill.pb.gz

Profile after the splits were done and the system was idle again. Average CPU = 22% profile-after.pb.gz sideTransport.publish - 34% server.computeMetricsPeriodically - 18% reporter.update - 10% (computing critical ranges) server.gossipStore - 10% server.writeNodeStatus - 11% queue.MaybeAdd - 6% (replicaGC queue and replicaQueue are the biggest)

Graphs of CPU for reference. The four phases are:

Before 12:43 - idle system with 60 ranges
12:43 - 13:42 - creating the 400K splits
13:45 - 14:00 - running the workload (I restarted for a couple minutes after to see if it was different, but it wasn't)
After 14:00 - idle system

Expected behavior An idle cluster should be using a nominal amount of the CPU when nothing is happening. This cluster was using XXX % and most of the work is unnecessary (checking ranges in queues when there is nothing to do and computing statistics). A 100K replica/node cluster is in the range that many of our customers have (although they typically have more cores).

Environment:

24.1 / master

Additional context We have seen clusters using a lot of CPU for non-user activities and we are trying to make our system faster over time. Addressing some low hanging fruit will improve the performance (higher throughput and lower latency) for users.

Jira issue: CRDB-37946

Epic CRDB-40201

andrewbaptist commented 7 months ago

To run some tests of reducing CPU, I set some cluster settings:

set cluster setting kv.closed_timestamp.side_transport_interval = '5s';
set cluster setting kv.replication_reports.interval = 0;

I couldn't easily change the 10s node metrics interval which is used for computeMetricsPeriodically and writeNodeStatus without a rebuild (see #20810). After disabling the two things noted above the CPU usage dropped to 14%, of that time, 50% is spent computing capacity (in 3 different places).

Gossip wasn't as bad on this cluster, but is worse on a larger cluster (see #51838)

blathers-crl[bot] commented 6 months ago

cc @cockroachdb/replication

ajstorm commented 6 months ago

@andrewbaptist not sure what the priority is on this one. Can you please add a P-* label?

cockroachdb / cockroach

kv: high CPU load on an idle cluster #122528