cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.19k stars 3.82k forks source link

kv: undesirable behavior as the system load increases past sustainable limits #89142

Open andrewbaptist opened 2 years ago

andrewbaptist commented 2 years ago

Describe the problem

As the system pushes past sustainable limits, lots of bad things happen: 1) Throughput drops to a fraction of initial 2) LSM gets inverted 3) Nodes crash with out of memory

To Reproduce

Instructions are here: https://gist.github.com/andrewbaptist/27cee6c88ac9e704b5efaf15a37d4a37

What did you do? Describe in your own words.

1) Set up a 6 node cluster 2) Initialize with 10K warehouses 3) Begin a run with a ramp 4) Monitor the system

Expected behavior

TBD what the expected behavior is. Part of the goal of this is to define the expected behavior and compare to current behavior.

Environment:

Jira issue: CRDB-20133

andrewbaptist commented 2 years ago

screencapture-34-74-199-84-26258-2022-10-01-09_37_45 screencapture-34-74-199-84-26258-2022-10-01-09_37_21 screencapture-34-74-199-84-26258-2022-10-01-09_36_52 screencapture-34-74-199-84-26258-2022-10-01-09_36_26 screencapture-34-74-199-84-26258-2022-10-01-09_35_47 screencapture-34-74-199-84-26258-2022-10-01-09_35_35 screencapture-34-74-199-84-26258-2022-10-01-09_35_18 screencapture-34-74-199-84-26258-2022-10-01-09_35_07

andrewbaptist commented 2 years ago

logs.tgz

andrewbaptist commented 2 years ago

The first part of the run screencapture-34-74-199-84-26258-2022-10-01-09_44_45 screencapture-34-74-199-84-26258-2022-10-01-09_44_06 screencapture-34-74-199-84-26258-2022-10-01-09_41_34

andrewbaptist commented 2 years ago

From the experiments, the system has 4 "distinct" phases it goes through.

Phase 1

During phase 1, the system is mostly well-behaved. The number of transactions/sec is increasing linearly and the CPU utilization reaches 50% near the end. Runnable goroutines/CPU stays stable at about 0.4-0.5. The storage layer is very healthy during this period.

During this period the P90 latency stays flat, but P99 latency has a small rise. The latencies going up the stack:

Command commit latency - stays close to 30 microseconds (both P90 and P99) Log commit latency - P99 has a small increase from ~1ms to ~3ms, P50 is stable at 0.2ms. KV latency - P99 increases about linearly from 5ms to 75ms. P90 is flat at about 1ms. Service latency - P99 increases about linearly from 20ms to 50ms. P90 stays flat at about 5ms. Transaction latency - P99 increases linearly from about 100ms to about 400ms. P90 stays flat at about 100ms.

Phase 2

In phase 2, the system linearly increases throughout, as it goes from 190 to 380 transactions/sec/node. Admission control begins adding delays to requests which increases during the period. Everything begins linear increases during this period (CPU, goroutines/CPU, latencies). However system-level queues are not building as all requests are able to be processed.

Command commit latency - stays close to 30 microseconds (both P90 and P99). Log commit latency - P99 increases from ~3ms to ~15ms, P50 is stable at 0.2ms. KV latency - P99 increases about linearly from 5ms to 75ms. P90 increases from 1ms to 10ms. Service latency - P99 increases about linearly 50ms to 200ms. P90 increases from 5ms to 50ms. Transaction latency - P99 increases linearly from about 0.4s to about 1.2s. P90 increases from 100ms to 400ms.

Phase 3

In phase 3, the system starts at close to peak throughput of 500 transactions/sec/node. The CPU stays at 100% for this entire phase.

Transaction latency is maxed out at 10s/transactions and queues are building through the system. After it reaches the peak transactions/sec, additional requests are queued and SQL memory begins growing. The transaction throughput decreases during this period as the system becomes increasingly less efficient from about 500/sec to around 300/s near the end. Service and KV latencies are climbing during this period as the number of goroutines/core increases to about 20.

At the storage layer, the LSM becomes a little less healthy resulting in read amplification increasing from 5 to a5 near the end of the phase. It is not entirely clear why compaction can not keep up, but likely it is due to the limited CPU on the system not allowing compaction to run enough.

Phase 4

In phase 4, the system falls apart is it is well past a healthy rate of incoming requests. Memory usage becomes critical and the kernel kills the cockroach process on the nodes. The system declines in throughput to about 50-100/s with a highly unstable number being processed. The first node fails at 2:20 and the others fail by 2:45. The LSM also becomes severely unhealthy during this time. With the high number of goroutines running the system is not even able to correctly monitor clock offsets and they rise to almost 10ms.

sumeerbhola commented 2 years ago

Starting at 2:05 the "KV Admission Slots" are climbing. And so is the "Runnable Goroutines per CPU". The two increases may on different nodes, and its not clear to me whether for the former it is the used slots that are climbing or the limit. But admission control is no longer being able to exercise control, which suggests there is a lot of work bypassing admission control. The raftScheduler goroutines are bounded to 8*GOMAXPROCS so it cannot be follower work that is responsible for these goroutine counts. I suspect it is one of the things that explicitly bypassed admission control, akin to https://cockroachlabs.slack.com/archives/C0KB9Q03D/p1661288404006489?thread_ts=1661189220.814819&cid=C0KB9Q03D -- a goroutine dump and cpu profile would tell us.

andrewbaptist commented 2 years ago

These are taken about 30 minutes before the nodes will OOM.

Goroutine dump: image profile (1).pb.gz

CPU Profile: image profile.pb.gz

sumeerbhola commented 2 years ago

There are only 22 goroutines not in runtime.gopark (which is a waiting state), so there shouldn't be a high runnable count when this profile was taken.

github-actions[bot] commented 7 months ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!