Open irfansharif opened 7 years ago
~ env COCKROACH_DEDICATED_RAFT_STORAGE=DISABLED make bench PKG=./pkg/storage BENCHES="BenchmarkReplicaRaftStorage" TESTFLAGS="-count 10" > perf-disabled
~ env COCKROACH_DEDICATED_RAFT_STORAGE=ENABLED make bench PKG=./pkg/storage BENCHES="BenchmarkReplicaRaftStorage" TESTFLAGS="-count 10" > perf-enabled
~ benchstat perf-enabled perf-disabled
name old time/op new time/op delta
ReplicaRaftStorage/vs=1024-4 1.70ms ± 4% 0.44ms ±12% -73.90% (p=0.000 n=10+10)
ReplicaRaftStorage/vs=4096-4 1.91ms ± 2% 0.63ms ±11% -66.95% (p=0.000 n=10+10)
ReplicaRaftStorage/vs=16384-4 2.74ms ± 2% 1.46ms ± 4% -46.68% (p=0.000 n=10+10)
ReplicaRaftStorage/vs=65536-4 5.16ms ± 5% 4.38ms ± 3% -15.10% (p=0.000 n=10+10)
ReplicaRaftStorage/vs=262144-4 13.6ms ± 4% 16.4ms ± 3% +20.51% (p=0.000 n=10+10)
ReplicaRaftStorage/vs=1048576-4 50.9ms ± 3% 71.0ms ± 4% +39.70% (p=0.000 n=10+8)
Early, untuned & possibly incorrect benchmarks. Will update comment once verified.
Any insight as to why performance gets worse with a dedicated Raft storage engine at larger value sizes?
Not yet, no. It shouldn't happen as far as I can tell, doesn't align with the early experiments done in #16361 either. Investigating.
there was particularly ill-suited reflect.DeepEqual
debug assertion here skewing things. That aside we have:
~ benchstat perf-disabled perf-enabled
name old time/op new time/op delta
ReplicaRaftStorage/vs=1024-4 320µs ± 6% 385µs ±18% +20.29% (p=0.000 n=10+10)
ReplicaRaftStorage/vs=4096-4 613µs ±14% 580µs ± 2% ~ (p=0.278 n=10+9)
ReplicaRaftStorage/vs=16384-4 2.59ms ± 3% 2.05ms ± 4% -20.87% (p=0.000 n=10+9)
ReplicaRaftStorage/vs=65536-4 4.11ms ± 7% 3.29ms ± 3% -19.97% (p=0.000 n=10+10)
ReplicaRaftStorage/vs=262144-4 13.4ms ± 8% 10.7ms ± 3% -20.39% (p=0.000 n=10+10)
ReplicaRaftStorage/vs=1048576-4 56.8ms ± 3% 36.4ms ± 2% -35.91% (p=0.000 n=10+10)
This one? https://github.com/cockroachdb/cockroach/issues/16591
On Thu, Jun 29, 2017, 22:02 irfan sharif notifications@github.com wrote:
there was particularly ill-suited reflect.DeepEqual debug assertion here skewing things. That aside we have:
~ benchstat perf-disabled perf-enabled
name old time/op new time/op delta ReplicaRaftStorage/vs=1024-4 320µs ± 6% 385µs ±18% +20.29% (p=0.000 n=10+10) ReplicaRaftStorage/vs=4096-4 613µs ±14% 580µs ± 2% ~ (p=0.278 n=10+9) ReplicaRaftStorage/vs=16384-4 2.59ms ± 3% 2.05ms ± 4% -20.87% (p=0.000 n=10+9) ReplicaRaftStorage/vs=65536-4 4.11ms ± 7% 3.29ms ± 3% -19.97% (p=0.000 n=10+10) ReplicaRaftStorage/vs=262144-4 13.4ms ± 8% 10.7ms ± 3% -20.39% (p=0.000 n=10+10) ReplicaRaftStorage/vs=1048576-4 56.8ms ± 3% 36.4ms ± 2% -35.91% (p=0.000 n=10+10)
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/16624#issuecomment-312154888, or mute the thread https://github.com/notifications/unsubscribe-auth/AE135J3qhkvLqnrSUqm-i2Zy8XFC-dWPks5sJFcdgaJpZM4N_3iO .
--
-- Tobias
nope, I had sprinkled them all over when implementing this to guarantee parity across the two instances. Just forgot to remove all of them when posting the first benchmark.
Have you experimented with disabling syncing when applying Raft log truncations? That would be the ideal scenario with regards to syncing of the non-Raft storage engine.
On Mon, Jul 17, 2017 at 3:08 PM, irfan sharif notifications@github.com wrote:
The following are some results from experiments to determine the relative lack of speed ups from #16809 https://github.com/cockroachdb/cockroach/pull/16809 (pardon the verbosity it's primarily self-documentation). I ran two long running experiments on navy, ycsb -workload F (pure writes) with the same initial-load. From 7/13 04:00 (grafana time) to 7/13 21:11 for the first run (with new changes), second from 7/13 21:27 to 7/14 14:39 First, the results:
--- Before 17h21m56s 1145.0 0 / 0 / 0 1145 / 0 0 / 0 elapsedops/sec(total)errors(total) 62516.6s 919.5 0
--- After 17h49m5s 1099.0 0 / 0 / 0 1099 / 0 0 / 0 elapsedops/sec(total)errors(total) 64145.9s 896.2 0
Overall a 2.59% decrease in throughput. I did the same for a shorter run:
--- Before 1h12m25s 1084.9 0 / 0 / 0 1085 / 0 0 / 0 elapsedops/sec(total)errors(total) 4345.3s 900.9 0
--- After 1h18m40s 1060.0 0 / 0 / 0 1060 / 0 0 / 0 elapsedops/sec(total)errors(total) 4720.8s 877.0 0
Similarly, 2.62% decrease in throughput.
Here are some interesting graphs from the longer run, the first half is navy with the dedicated raft engine changes and the second without: [image: image] https://user-images.githubusercontent.com/10536690/28283345-7a5f13dc-6afb-11e7-8352-fe73ac16c1ed.png
Overall the change, as is, seems to increase Raft command commit latency. Picking navy-006 to remove the outlier: [image: image] https://user-images.githubusercontent.com/10536690/28283531-f96ee1a2-6afb-11e7-94d8-fe8c4833d781.png
We have more CGo calls than we did before (this is expected, we're separately committing to a whole another engine without any form of batching across the two): [image: image] https://user-images.githubusercontent.com/10536690/28283518-efbf52fe-6afb-11e7-835d-4639cdf0ded9.png
We have slightly longer raft log processing times: [image: image] https://user-images.githubusercontent.com/10536690/28283883-404689e4-6afd-11e7-89dd-253714211b72.png
Some other asides:
- Overall memory usage goes slightly higher [image: image] https://user-images.githubusercontent.com/10536690/28283595-287417f6-6afc-11e7-9d85-798207072470.png
- We seem to have more replicas/ranges per node (I don't know if this a valid inference, or why this comes about) [image: image] https://user-images.githubusercontent.com/10536690/28283706-8b16a554-6afc-11e7-9038-6dfea3725f45.png
- Ignore the graph headings but here are RocksDB stats for the new engine [image: image] https://user-images.githubusercontent.com/10536690/28284158-0f742596-6afe-11e7-8bba-4dde871db3bd.png
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/16624#issuecomment-315851707, or mute the thread https://github.com/notifications/unsubscribe-auth/AF6f99-Pl6b_hIkgyXXqFfoJoGVm4pvpks5sO7EwgaJpZM4N_3iO .
The following are some results from experiments to determine the relative lack of speed ups from #16809 (pardon the verbosity it's primarily self-documentation). I ran two long running experiments on navy
, ycsb -workload F
(pure writes) with the same initial-load
. From 7/13 04:00 (grafana time) to 7/13 21:11 for the first run (with new changes), second from 7/13 21:27 to 7/14 14:39.
First, the results:
--- Before
17h21m56s 1145.0 0 / 0 / 0 1145 / 0 0 / 0
elapsed__ops/sec(total)__errors(total)
62516.6s 919.5 0
--- After
17h49m5s 1099.0 0 / 0 / 0 1099 / 0 0 / 0
elapsed__ops/sec(total)__errors(total)
64145.9s 896.2 0
Overall a 2.59% decrease in throughput. I did the same for a shorter run:
--- Before
1h12m25s 1084.9 0 / 0 / 0 1085 / 0 0 / 0
elapsed__ops/sec(total)__errors(total)
4345.3s 900.9 0
--- After
1h18m40s 1060.0 0 / 0 / 0 1060 / 0 0 / 0
elapsed__ops/sec(total)__errors(total)
4720.8s 877.0 0
Similarly, 2.62% decrease in throughput.
Here are some interesting graphs from the longer run, the first half is navy
with the dedicated raft engine changes and the second without:
Overall the change, as is, seems to increase Raft command commit latency. Picking navy-006
to remove the outlier:
We have more CGo calls than we did before (this is expected, we're separately committing to a whole another engine without any form of batching across the two):
We have slightly longer raft log processing times:
I tried using go-torch
after, here are some flame graphs operating on the same data:
Before:
After:
They both mostly have the same structure (good, expected) except after we have a much wider second flame (I don't know what else to call it). Looking into this further using callgrind (screenshots not shown) I was able to confirm a larger proportion of time spent in CGo land, happening right about here:
. . 4540: isLogTruncationRequest := rResult.RaftLogDelta != nil
. 3.62s 4541: if err := batch.Commit(isLogTruncationRequest); err != nil {
. . 4542: return enginepb.MVCCStats{}, roachpb.NewError(NewReplicaCorruptionError(
. . 4543: errors.Wrap(err, "could not commit batch")))
. . 4544: }
. 700ms 4545: if err := raftBatch.Commit(isLogTruncationRequest); err != nil {
. . 4546: return enginepb.MVCCStats{}, roachpb.NewError(NewReplicaCorruptionError(
. . 4547: errors.Wrap(err, "could not commit raft batch")))
. . 4548: }
10ms 10ms 4549: elapsed := timeutil.Since(start)
. 30ms 4550: r.store.metrics.RaftCommandCommitLatency.RecordValue(elapsed.Nanoseconds())
. 10ms 4551: return rResult.Delta, nil
. . 4552:}
. . 4553:
I tried disabling raft log syncing (for both engines, before and after these changes to get a fair comparison) but still saw similar results. As for what I think is happening is our current batching process for RocksDB batch commits. They are specific to individual engines and now that we have two, we batch all writes for the first engine and then separately do the same for the second. This creates this two-phased batching process which might explain the drop in throughput. Compare this to a better batching strategy batching all writes going down to RocksDB, regardless of the engine addressed. Looking into what else can be shared across the two instances, by default the total number of compaction threads are but there probably are more (see include/rocksdb/env.h).
Some other asides:
+cc @tschottdorf.
I tried disabling raft log syncing (for both engines, before and after these changes to get a fair comparison) but still saw similar results. As for what I think is happening is our current batching process for RocksDB batch commits. They are specific to individual engines and now that we have two, we batch all writes for the first engine and then separately do the same for the second. This creates this two-phased batching process which might explain the drop in throughput. Compare this to a better batching strategy batching all writes going down to RocksDB, regardless of the engine addressed.
Yeah, that does seem problematic. When adding the dedicated syncing goroutine, I considered adding a dedicated goroutine to perform the commits. With some restructuring, I think we could have a dedicated per-engine goroutine for committing batches, queue both batches and then wait for both of the commits to happen.
Rebasing on top of new master I'm seeing 0 QPS occurring quite frequently, this is ycsb -workload F
on ultramarine
:
9m17s 0.0 0 / 0 / 0 0 / 0 0 / 0
9m18s 0.0 0 / 0 / 0 0 / 0 0 / 0
9m19s 0.0 0 / 0 / 0 0 / 0 0 / 0
9m20s 0.0 0 / 0 / 0 0 / 0 0 / 0
elapsed______ops/sec__reads/empty/errors___writes/errors____scans/errors
9m21s 0.0 0 / 0 / 0 0 / 0 0 / 0
9m22s 0.0 0 / 0 / 0 0 / 0 0 / 0
9m23s 116.0 0 / 0 / 0 116 / 0 0 / 0
9m24s 499.0 0 / 0 / 0 499 / 0 0 / 0
9m25s 420.0 0 / 0 / 0 420 / 0 0 / 0
9m26s 0.0 0 / 0 / 0 0 / 0 0 / 0
9m27s 0.0 0 / 0 / 0 0 / 0 0 / 0
9m28s 0.0 0 / 0 / 0 0 / 0 0 / 0
9m29s 0.0 0 / 0 / 0 0 / 0 0 / 0
9m30s 405.0 0 / 0 / 0 405 / 0 0 / 0
9m31s 59.0 0 / 0 / 0 59 / 0 0 / 0
9m32s 0.0 0 / 0 / 0 0 / 0 0 / 0
9m33s 0.0 0 / 0 / 0 0 / 0 0 / 0
...
elapsed__ops/sec(total)__errors(total)
773.4s 155.4 0
This is new post-rebase, I suspect it's due to #16942 and the reworked batching mechanisms. Isolating a build without those changes.
For posterity these are the numbers for without the raft log changes:
elapsed______ops/sec__reads/empty/errors___writes/errors____scans/errors
6m1s 558.0 0 / 0 / 0 558 / 0 0 / 0
6m2s 541.0 0 / 0 / 0 541 / 0 0 / 0
6m3s 596.0 0 / 0 / 0 596 / 0 0 / 0
6m4s 553.0 0 / 0 / 0 553 / 0 0 / 0
6m5s 629.0 0 / 0 / 0 629 / 0 0 / 0
6m6s 538.0 0 / 0 / 0 538 / 0 0 / 0
6m7s 566.0 0 / 0 / 0 566 / 0 0 / 0
6m8s 597.0 0 / 0 / 0 597 / 0 0 / 0
...
elapsed__ops/sec(total)__errors(total)
368.0s 457.6 0
No zero QPS, almost a thrice as fast (!).
Nope, something else going on here. Here's a build reverting three related commits to the new batching/syncing mechanism:
a86bba2f6 Revert "storage/engine: use SyncWAL instead of sync on commit"
f9379de0e Revert "storage/engine: unify code to commit write-only and read-write batches"
ec8bd15c0 Revert "storage/engine: lazily initialize the RocksDB batch"
4m1s 0.0 0 / 0 / 0 0 / 0 0 / 0
4m2s 0.0 0 / 0 / 0 0 / 0 0 / 0
4m3s 244.0 0 / 0 / 0 244 / 0 0 / 0
4m4s 12.0 0 / 0 / 0 12 / 0 0 / 0
4m5s 0.0 0 / 0 / 0 0 / 0 0 / 0
4m6s 0.0 0 / 0 / 0 0 / 0 0 / 0
4m7s 0.0 0 / 0 / 0 0 / 0 0 / 0
4m8s 0.0 0 / 0 / 0 0 / 0 0 / 0
4m9s 0.0 0 / 0 / 0 0 / 0 0 / 0
4m10s 0.0 0 / 0 / 0 0 / 0 0 / 0
4m11s 280.0 0 / 0 / 0 280 / 0 0 / 0
...
elapsed__ops/sec(total)__errors(total)
251.3s 245.9 0
Zero QPS points are during lots of completely idle CPU time:
0 QPS like that often indicates a problem with leases.
Circling back to this issue, still visible after restructuring around truncations moving downstream and happening on it's own dedicated queue after the fact ("truncations" only update TruncatedState, actual deletion of log entries happen after the fact). I'm sure there's a very simple oversight causing this but still investigating. Below are some graphs from a longer run last night where we observe gradual degradation to 0 QPS point, my initial theory is some funky interaction with the existing raft log queue (queueing truncations) and the snapshot queue (queueing snapshots; on initial investigation however the requests seem to originate out of raft directly). We see continuous cycle of snapshot generation + application for the same small subset of ranges (this should have been mitigated after the proposal quota pool, I've come full circle). The workload here is ycsb -workload F -splits 1000 -concurrency 1000
.
Forgive the verbosity/noise, I don't know what's relevant as yet.
Slow degradation in throughput followed by intermittent peaks:
Enqueued replicas despite 0 throughput, also a lot of failures:
Surge of non-preemptive snapshots generated (master
doesn't show this for the same workload):
Raft log self behind creeps up:
Replicas turn quiescent:
Other side effects, raft is blocked as well:
Seeing a lot of snapshot generation/application isolated to few specific ranges:
I170725 03:58:16.971243 1078385 storage/replica_raftstorage.go:769 [n1,s1,r492/1:/Table/51/1/{691283…-701234…}] applying Raft snapshot at index 8570 (id=1d594482, encoded size=8836147, 34 rocksdb batches, 24 log entries)
I170725 03:58:17.112937 1078385 storage/replica_raftstorage.go:777 [n1,s1,r492/1:/Table/51/1/{691283…-701234…}] applied Raft snapshot in 142ms [clear=0ms batch=99ms entries=0ms commit=42ms]
I170725 03:58:18.162192 1083635 storage/replica_raftstorage.go:769 [n1,s1,r492/1:/Table/51/1/{691283…-701234…}] applying Raft snapshot at index 8619 (id=194c1442, encoded size=8895367, 34 rocksdb batches, 30 log entries)
I170725 03:58:18.315653 1083635 storage/replica_raftstorage.go:777 [n1,s1,r492/1:/Table/51/1/{691283…-701234…}] applied Raft snapshot in 153ms [clear=5ms batch=57ms entries=1ms commit=90ms]
I170725 03:58:19.455216 1088924 storage/replica_raftstorage.go:769 [n1,s1,r492/1:/Table/51/1/{691283…-701234…}] applying Raft snapshot at index 8665 (id=9afacbdb, encoded size=8941200, 34 rocksdb batches, 29 log entries)
I170725 03:58:19.535733 1088924 storage/replica_raftstorage.go:777 [n1,s1,r492/1:/Table/51/1/{691283…-701234…}] applied Raft snapshot in 80ms [clear=0ms batch=36ms entries=0ms commit=44ms]
I170725 03:58:20.617621 1094108 storage/replica_raftstorage.go:769 [n1,s1,r492/1:/Table/51/1/{691283…-701234…}] applying Raft snapshot at index 8703 (id=07eca121, encoded size=8972789, 35 rocksdb batches, 24 log entries)
I170725 03:58:20.861394 1094108 storage/replica_raftstorage.go:777 [n1,s1,r492/1:/Table/51/1/{691283…-701234…}] applied Raft snapshot in 244ms [clear=0ms batch=174ms entries=1ms commit=69ms]
I170725 03:58:21.939922 1099239 storage/replica_raftstorage.go:769 [n1,s1,r492/1:/Table/51/1/{691283…-701234…}] applying Raft snapshot at index 8748 (id=59467f09, encoded size=9017571, 35 rocksdb batches, 23 log entries)
I170725 03:58:22.004040 1099239 storage/replica_raftstorage.go:777 [n1,s1,r492/1:/Table/51/1/{691283…-701234…}] applied Raft snapshot in 64ms [clear=0ms batch=8ms entries=0ms commit=55ms]
I170725 03:58:23.110255 1104397 storage/replica_raftstorage.go:769 [n1,s1,r492/1:/Table/51/1/{691283…-701234…}] applying Raft snapshot at index 8802 (id=f619d9ce, encoded size=9087894, 35 rocksdb batches, 33 log entries)
I'm not sure what it means yet, but I think that last graph is going to be the key: we have periodic bursts of messages getting through, coinciding with an increase in the level of dropped messages (and each time the spikes/increases get smaller). It looks like it repeats at regular intervals, but not at a round number (11 minutes?)
This suggests problems in the raft transport, maybe flow control related. I'd try increasing initialConnWindowSize
in rpc/context.go
.
I'm sure there's a very simple oversight causing this but still investigating
Found the issue, it was due to Replica.mu.raftLogSize
not getting updated correctly despite actual truncations taking place. This happened after a live "migration" over to the new engine when upstream of raft the "simulated truncation" (https://github.com/cockroachdb/cockroach/blob/7ca57b6cce19339e811fafc0421b90b978f098e0/pkg/storage/replica_command.go#L1927) did not play well with "actual" truncations happening downstream. The result of this was that despite "actual" truncations happening (log entries got deleted in RockDB), the raft log queue still accrued replicas as individual Replica.mu.raftLogSize
's grew indefinitely. All the following emergent side effects could be traced down to just that.
The graphs looks steady following a 10hr run last night, albeit with a reduced throughput.
Here are the new degradations (implementation is as yet incomplete, awkward state debugging the previous issue), things to look out for:
Still we have more replicas queued for log truncations (I have a handle on why, addressing):
Despite this we seem to be processing fewer replica log truncations:
Side effect perhaps, raft throughput takes a hit:
Most pressing, raft log commit latencies shoot way up. To a lesser degree the same is true for commit latencies. For this run log truncations, though processed downstream, happen just as frequently. The Raft RocksDB instance is untuned (same defaults as the base engine). Investigating.
Another stab at this, w/ dedicated raft log gc queue "actually" deleting after the fact. Live migration takes place around 16:52, we maintain steady performance for a while before eventual 0 QPS again. Similar symptoms as before, and at first glance similar causes (over-aggressive raft log truncations).
Dropping throughput, eventually zero, starting at 16:52:
Uptick in replicas queued in the raft log queue, again, starting at 16:52:
Again we eventually get to cyclic snapshot generation:
Right around then more replicas report themselves to be "behind" (hence the snapshots):
Replicas eventually quiesce, we're now at 0 QPS:
Raft log commit latencies shoot up, raft command commits too but not so much. Consider that the new engine is as yet untuned:
As an aside, some new metrics we're tracking now. Will aid when tuning:
Heh, I've used the term "snapshot death spiral" for this behavior. Similar problems occurred in the past when Raft log truncation was truncating while we were trying to catch up or rebalance a replica. The heuristics now in raft_log_queue.go
are supposed to avoid this, but perhaps something in this PR has broken those heuristics.
Fixing the zero QPS issue (more raft log truncation simulation badness, sigh) I'm running into OOM/high RocksDB memory usage behavior.
First two steep runs are running truncations in the separate raft log gc queue using tombstones, fairly infrequently at that. The disk usage for the raft engine just grows out of bound indefinitely until we hit 47 GiB when we are killed. The long running stable run is our disk usage without raft engine enabled where we seem to stay steady under the same constant load (ycsb -workload F -splits 500 -concurrency 500
).
The differences between the three runs where we max out disk space:
The blip towards the end of the first run is what happens when the first node is killed having run out of memory and queries are stalled temporarily allowing the other nodes to presumably run compactions and lower disk usage.
Here are some compactions stats right before this happens on n2,s2
:
I170802 14:47:44.643162 92 storage/store.go:4386 [n2,s2] sstables (raft eng read amplification = 3):
25438 0 [ 38M 1 ]: 38M
25439 5 [ 4M 1 ]: 4M
25440 6 [ 79M 2 ]: 77M 1M
25441 I170802 14:47:44.643226 92 storage/store.go:4387 [n2,s2]
25442 ** Compaction Stats [default] **
25443 Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
25444 ----------------------------------------------------------------------------------------------------------------------------------------------------------
25445 L0 1/0 37.67 MB 0.9 0.0 0.0 0.0 3.7 3.7 0.0 1.0 0.0 49.6 76 99 0.770 0 0
25446 L5 1/0 3.60 MB 0.5 3.5 3.5 0.0 2.8 2.8 0.0 0.8 46.4 37.2 78 47 1.654 6896K 686K
25447 L6 2/0 78.83 MB 0.0 6.5 3.0 3.5 3.6 0.1 0.0 1.2 62.5 34.6 106 63 1.678 9644K 6277K
25448 Sum 4/0 120.09 MB 0.0 10.0 6.5 3.5 10.1 6.6 0.0 2.7 39.3 39.8 260 209 1.243 16M 6963K
25449 Int 0/0 0.00 KB 0.0 1.3 0.8 0.5 1.3 0.9 0.0 2.7 28.4 29.4 46 25 1.837 2117K 849K
Uptime(secs): 4810.2 total, 4810.2 interval
25451 Flush(GB): cumulative 3.693, interval 0.491
25452 AddFile(GB): cumulative 0.000, interval 0.000
25453 AddFile(Total Files): cumulative 0, interval 0
25454 AddFile(L0 Files): cumulative 0, interval 0
25455 AddFile(Keys): cumulative 0, interval 0
25456 Cumulative compaction: 10.09 GB write, 2.15 MB/s write, 9.98 GB read, 2.12 MB/s read, 259.7 seconds
25457 Interval compaction: 1.32 GB write, 0.28 MB/s write, 1.27 GB read, 0.27 MB/s read, 45.9 seconds
25458 Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_byte s, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count
Here are some after:
29674 I170802 15:07:44.645798 92 storage/store.go:4386 [n2,s2] sstables (raft eng read amplification = 1):
29675 6 [ 7M 1 ]: 7M
29676 I170802 15:07:44.645980 92 storage/store.go:4387 [n2,s2]
29677 ** Compaction Stats [default] **
29678 Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
29679 ----------------------------------------------------------------------------------------------------------------------------------------------------------
29680 L0 0/0 0.00 KB 0.0 0.0 0.0 0.0 3.8 3.8 0.0 1.0 0.0 49.5 78 102 0.766 0 0
29681 L5 0/0 0.00 KB 0.0 3.6 3.6 0.0 2.9 2.9 0.0 0.8 47.0 37.3 80 49 1.623 7267K 743K
29682 L6 1/0 7.23 MB 0.0 6.7 3.0 3.6 3.7 0.0 0.0 1.2 61.2 33.5 112 65 1.718 10M 6628K
29683 Sum 1/0 7.23 MB 0.0 10.3 6.7 3.7 10.3 6.7 0.0 2.7 39.2 39.3 269 216 1.247 17M 7371K
29684 Int 0/0 0.00 KB 0.0 0.1 0.1 0.1 0.0 -0.1 0.0 3.2 171.1 28.6 1 3 0.264 436K 263K
29685 Uptime(secs): 6010.2 total, 6010.2 interval
29686 Flush(GB): cumulative 3.776, interval 0.007
29687 AddFile(GB): cumulative 0.000, interval 0.000
29688 AddFile(Total Files): cumulative 0, interval 0
29689 AddFile(L0 Files): cumulative 0, interval 0
29690 AddFile(Keys): cumulative 0, interval 0
29691 Cumulative compaction: 10.33 GB write, 1.76 MB/s write, 10.32 GB read, 1.76 MB/s read, 269.3 seconds
29692 Interval compaction: 0.02 GB write, 0.00 MB/s write, 0.13 GB read, 0.02 MB/s read, 0.8 seconds
29693 Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_byte s, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count
Disk size usage before (raft data is only stored in the /raft
subdir, the delta is the disk space used by the single base engine):
~ crl-ssh ultramarine all sudo du -sh /mnt/data1/raft
1: cockroach@cockroach-ultramarine-0001.crdb.io
11G /mnt/data1/raft
2: cockroach@cockroach-ultramarine-0002.crdb.io
11G /mnt/data1/raft
3: cockroach@cockroach-ultramarine-0003.crdb.io
12G /mnt/data1/raft
4: cockroach@cockroach-ultramarine-0004.crdb.io
10G /mnt/data1/raft
~ crl-ssh ultramarine all sudo du -sh /mnt/data1
1: cockroach@cockroach-ultramarine-0001.crdb.io
16G /mnt/data1
2: cockroach@cockroach-ultramarine-0002.crdb.io
15G /mnt/data1
3: cockroach@cockroach-ultramarine-0003.crdb.io
16G /mnt/data1
4: cockroach@cockroach-ultramarine-0004.crdb.io
15G /mnt/data1
Disk usage after, n1
was killed and restarted. Note the disk usage for the other nodes dropped in the interim.
➜ server git:(dedicated-raft-storage) crl-ssh ultramarine all sudo du -sh /mnt/data1/raft
1: cockroach@cockroach-ultramarine-0001.crdb.io
80M /mnt/data1/raft
2: cockroach@cockroach-ultramarine-0002.crdb.io
11G /mnt/data1/raft
3: cockroach@cockroach-ultramarine-0003.crdb.io
12G /mnt/data1/raft
4: cockroach@cockroach-ultramarine-0004.crdb.io
11G /mnt/data1/raft
➜ server git:(dedicated-raft-storage) crl-ssh ultramarine all sudo du -sh /mnt/data1
1: cockroach@cockroach-ultramarine-0001.crdb.io
4.8G /mnt/data1
2: cockroach@cockroach-ultramarine-0002.crdb.io
17G /mnt/data1
3: cockroach@cockroach-ultramarine-0003.crdb.io
18G /mnt/data1
4: cockroach@cockroach-ultramarine-0004.crdb.io
17G /mnt/data1
Before looking into this further, is there something in place currently to bound disk usage for the base engine? I'm surprised it grows so rapidly having simply moved raft log entries + truncations to a separate engine and reducing the frequency in which it occurs. Raft engines for all three runs are using the same config/tunings as the base engine.
As mentioned offline, there is no limiter on the amount of disk space RocksDB uses. The significant disk growth is very surprising. I wonder if RocksDB isn't compacting sstables in the Raft log engine for some reason. Nothing is jumping out at me in the compaction stats. Do you have compaction stats from the Raft log engine when its disk usage grew to 47GB?
The snippets above are the stats from the raft log engine. Investigating.
The snippets above are the stats from the raft log engine. Investigating.
Huh? The total data size in sstables there is tiny. I guess that makes sense on the one hand, we don't expect the Raft log engine to take much space on disk. So where is the 47GB number coming from?
strange, the compaction stats doesn't match up with the files on disk.
I170802 17:39:07.336853 10 storage/store.go:4386 [n1,s1] sstables (raft eng read amplification = 1):
6 [ 2M 1 ]: 2M
I170802 17:39:07.336915 10 storage/store.go:4387 [n1,s1]
** Compaction Stats [default] **
Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
----------------------------------------------------------------------------------------------------------------------------------------------------------
L0 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 23.7 1 15 0.087 0 0
L6 1/0 1.82 MB 0.0 0.1 0.0 0.0 0.0 -0.0 0.0 0.9 81.5 37.7 1 14 0.047 613K 588K
Sum 1/0 1.82 MB 0.0 0.1 0.0 0.0 0.1 0.0 0.0 1.8 27.5 28.4 2 29 0.068 613K 588K
Int 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.000 0 0
Where I'm supposed to have a single 2MB L6 file, instead I have:
cockroach@cockroach-ultramarine-0001:/mnt/data1/raft$ du -sh *
2.0M 000007.sst
2.4M 000011.sst
2.0M 000014.sst
1.8M 000017.sst
1.7M 000020.sst
1.2M 000023.sst
2.0M 000026.sst
1.5M 000029.sst
2.1M 000032.sst
1.5M 000035.sst
2.1M 000038.sst
2.1M 000041.sst
2.2M 000044.sst
1.4M 000047.sst
55M 000048.log
1.9M 000050.sst
4.0K auxiliary
4.0K COCKROACHDB_VERSION
4.0K CURRENT
4.0K IDENTITY
0 LOCK
4.0K MANIFEST-000008
8.0K OPTIONS-000005
That's very strange. I'd attempt to reproduce this on a local single-node cluster and try to get to the bottom of it.
The issue above was a resource leak, keeping snapshots open leaves on disk SSTs but they don't show up in the compaction stats. TIL.
I put together another change here, obvious in hindsight, where we can get away with flushing the base engine's WAL once for all replicas (as opposed to flushing each time when deleting each replica's raft log entries) and deleting entries up to that compaction point in a dedicated scanner + queue. The way you'd do this is a first pass "mark" phase, marking the low watermark for log entry deletions for all replicas followed by a single base engine flush. After this we simply delete up to the marked point. See #16361 for why this is even necessary. Right now this deletion happens sequentially though it need not be, it can be batched across all ranges to avoid multiple (fast, we use range deletion tombstones) CGo calls but even by not doing that there are measurable perf gains.
Before getting into some of the results, the original motivation to decouple the two engines was due to unnecessary syncing of user level keys when syncing raft log data (described in more detail in #16361). What this change intended to do was to reduce RocksDB commit latencies. If they were synced, they would additionally flush previously unsynced writes. If we maintain two engines however we can ensure synced writes only flush what absolutely needs to be flushed.
As was observed in #16942 (and repeated below), at the time of writing our 99th percentile raft command commit latencies were ~30-40ms. It observed that most of the time unsynced writes were waiting around for an earlier WAL syncing. By introducing a dedicated RocksDB SyncWAL
goroutine and slightly changing the semantics of synced writes (add batch to WAL, add entries to mem table, sync WAL instead of add batch to WAL, sync WAL and finally add entries to memtable), we were able to avoid having the unsynced writes wait around for synced writes. This brought down the raft command commit latencies from ~30-40ms to ~6-10ms.
Putting aside raft command commit latency reductions via the implementation in #16942, given what we know of unsynced commands waiting for earlier synced commands flushing the WAL, if indeed we use two separate engines with the base engine never flushing (except for the batched log truncations, but ignored given how infrequent it is) we should similarly see a reduction in raft command commit latencies and an increase in throughput. This is in fact is what we observe.
Two runs, 1hr long on ultramarine
running ycsb -workload F -splits 500 -concurrency 500
.
Before (without the SyncWAL
changes and before introducing the dedicated raft engine):
Before (without the SyncWAL
changes and after introducing the dedicated raft engine):
I ran the same with SyncWAL changes, verifying the 15% boost alluded to in #16942. Here's a table with all the many combinations (undedicated is using the single base engine for all keys including raft data).
| pre-syncWAL | syncWAL
------------------------------------------
undedicated | 2063.8 ops/s | 2358.0 ops/s (+14.26%)
------------------------------------------
dedicated | 2215.1 ops/s | 2383.5 ops/s (+7.60%)
(+7.33%) (+1.08%)
As for the difference between +7.33% going from (pre-syncWAL,undedicated) to (pre-syncWAL,dedicated) and +14.26% going from (pre-syncWAL,undedicated) to (syncWAL,undedicated), I'm not completely sure. We seem to get an increase in throughput for similar reasons but not quite to the same extent. The two runs can be see here (+7.33%) & here (+14.26%).
Only thing that stood out to me is that 99th percentile raft log commit time jumps up a bit (node 1 is the node ycsb
is running on).
Before (without the SyncWAL
changes and before introducing the dedicated raft engine):
After (without the SyncWAL
changes and after introducing the dedicated raft engine):
My guess would be this is because the new engine, now that it's batching batches addressed to itself, is collecting more "synchronous" batches. Given that previously raft log commits got batched with raft command commits, the latency "evened out" per se. Now that it's isolated to batches with raft data addressed at the same engine, we might just be accreting more "synchronous" batches therefore contributing to slightly higher 99th percentile raft log commit latencies.
Now, after these changes there are certain existing tunings that are suboptimal with the new restructuring of raft log truncations. Consider that "truncation requests" going through the system don't actually truncate the log, but simply update TruncatedState
. The actual deletions are deferred, happening after the fact in the dedicated scanner + queue. One simple tuning I did to test this was lower the frequency of raft log truncations by simply increasing RaftLogQueueTimerDuration
and RaftLogQueueStaleThreshold
:
diff --git i/pkg/storage/raft_log_queue.go w/pkg/storage/raft_log_queue.go
index b43a11841..6adb2c695 100644
--- i/pkg/storage/raft_log_queue.go
+++ w/pkg/storage/raft_log_queue.go
@@ -43,11 +43,11 @@ const (
// RaftLogQueueTimerDuration is the duration between truncations. This needs
// to be relatively short so that truncations can keep up with raft log entry
// creation.
- RaftLogQueueTimerDuration = 50 * time.Millisecond
+ RaftLogQueueTimerDuration = 100 * time.Millisecond
// RaftLogQueueStaleThreshold is the minimum threshold for stale raft log
// entries. A stale entry is one which all replicas of the range have
// progressed past and thus is no longer needed and can be truncated.
- RaftLogQueueStaleThreshold = 100
+ RaftLogQueueStaleThreshold = 150
// RaftLogQueueStaleSize is the minimum size of the Raft log that we'll
// truncate even if there are fewer than RaftLogQueueStaleThreshold entries
// to truncate. The value of 64 KB was chosen experimentally by looking at
Another change I made was bumping up the scheduler concurrency, this was motivated by trying to induce more batching at each of the two RocksDB commits (raft log and raft command).
diff --git i/pkg/storage/store.go w/pkg/storage/store.go
index a458d338b..4ff6bae28 100644
--- i/pkg/storage/store.go
+++ w/pkg/storage/store.go
@@ -123,7 +123,7 @@ var defaultRaftElectionTimeoutTicks = envutil.EnvOrDefaultInt(
"COCKROACH_RAFT_ELECTION_TIMEOUT_TICKS", 15)
var storeSchedulerConcurrency = envutil.EnvOrDefaultInt(
- "COCKROACH_SCHEDULER_CONCURRENCY", 12*runtime.NumCPU())
+ "COCKROACH_SCHEDULER_CONCURRENCY", 8*runtime.NumCPU())
var enablePreVote = envutil.EnvOrDefaultBool(
I ran another set of four runs, similarly 1h long and similar permutations as above, and measured the perf gains.
| pre-syncWAL | syncWAL
------------------------------------------
undedicated | 2252.6 ops/s | 2390.7 ops/s (+6.13%)
------------------------------------------
dedicated | 2345.1 ops/s | 2472.2 ops/s (+5.42%)
(+4.11%) (+3.40%)
The difference between going from (pre-syncWAL,undedicated) to (syncWAL,undedicated) and (pre-syncWAL,undedicated) to (pre-syncWAL,dedicated) is less pronounced.
Overall the tunings help (+cc #17165).
| undedicated,pre | undedicated,post | dedicated,pre | dedicated,post
---------------------------------------------------------------------------------------
untuned | 2063.8 ops/s | 2358.0 ops/sec | 2215.1 ops/sec | 2383.5 ops/sec
---------------------------------------------------------------------------------------
tuned | 2252.6 ops/s | 2390.7 ops/sec | 2345.1 ops/sec | 2472.2 ops/sec
(+9.15%) (+2.76%) (+5.87%) (+3.72%)
Final column is the performance gain to be had with my first stab at tuning this in conjunction with the new engine, +3.72%.
TL;DR:
The perf gain this would have had without #16942 going in: +13.63% [Going from (pre-syncWAL,undedicated,untuned) to (pre-syncWAL,dedicated,tuned)]
The perf gain to be had now: +4.84% [Going from (syncWAL,undedicated,untuned) to (syncWAL,dedicated,tuned)]
@irfansharif This write-up is awesome reading. Lots of good work here even though this change isn't landing in 1.1.
Closing this as we're unlikely to revisit this approach. Storage of the Raft log remains a performance concern, but my current thinking is to either use something like RocksDB 2PC or to add a new mechanism to RocksDB for storage of write-once read-almost-never values.
It seems like we're actively doing this, right?
A design + interface was merged in https://github.com/cockroachdb/cockroach/pull/72795
The design mentioned in the previous comment has a WAL in both storage engines. The WAL for the state machine engine is very rarely synced. This is similar to the setup described in the second experiment from https://github.com/cockroachdb/cockroach/issues/38322#issuecomment-504856253 which demonstrated a doubling in throughput, with the following hypothesis "which is presumably in part because we no longer have Raft entry application getting mixed in the same WAL as the Raft log writes" -- i.e., since each fdatasync is syncing half the data volume, it is faster.
Due to the recent regression noticed when loosely coupled raft log truncation was enabled (see https://github.com/cockroachdb/cockroach/issues/78412#issuecomment-1119922463 and earlier comments by @erikgrinaker), we probably need a more thorough understanding of the write amplification effect of the various options.
The experiments here are done using a benchmark with a workload consisting of a set of ranges, for which it does a single MVCCBlindPut
, adds that batch to a raft log (with sync), then modifies the batch to also update the RangeAppliedState
and commits to the state machine (without sync). It cycles through all the ranges using a single thread -- this was done to ensure that the LSM(s) did not fall behind in compactions. Whenever a range has > 64KB in the raft log, a truncation is done, either in a tightly-coupled manner, i.e., the truncation is done immediately (without sync), or in a loosely-coupled manner where the next flushed callback from the state machine engine causes the truncation. Code is in https://github.com/sumeerbhola/cockroach/tree/sep_lsm.
Since there is no concurrency in the workload, there is no batching between calls to sync the WAL, so the throughput shouldn't be considered representative, but it serves as a way to compare alternatives. Also, most of the results here are about the write amplification (we discuss throughput at the end). The keys were 60 bytes and the values were 1000 bytes, and are completely random, so will not compress, which means the bytes in the WAL and in the sstables are quite comparable.
The benchmark is broadly used to compare 1eng-tight, 1eng-loose, 2eng-tight, 2eng-loose, with a few other knobs. Note that 2eng-tight and 1eng-loose can be considered odd:
These experiments were run on a gceworker with 100GB PD-ssd. The GCP documentation says these get 0.48MB/s/GB, so 48MB/s of throughput. Whether the provisioned bandwidth was saturated or not has no impact on the write amp numbers, but does have a bearing on the later throughput numbers. Measurements indicated we were using < 30MB/s, so not bandwidth saturated. All experiments set the maximum compaction concurrency to 8, and for 2eng this was split into 4 each for the 2 engines. Less than 1 CPU of the gceworker was being utilized by the benchmark, so there is no CPU saturation.
This is a favorable case for tightly-coupled truncation since each range will truncate after 64KB, which allows most truncations to happen while the original raft log entry is still in the memtable. We consider this case since even though a store has 1000s of ranges, it is possible that the write load is dominated by a handful of ranges, and the system should not regress in performance for such a case.
Summary: with the "large data volume" 2eng-loose was comparable to 1eng-tight, with fair memtable sizing across the 2 setups. This suggests that in the 1eng case there is more than expected write amplification caused by the modest amount of raft bytes that do get flushed to L0. This may need more investigation.
For the 2eng case, each WAL saw ~550MB of writes. For 2eng-tight, only 4.8MB was written to L0 for the raft-eng. So log truncation is ensuring that using an LSM to store the raft logs is very efficient -- all the write cost is in the WAL of the raft-eng. The following shows the summary numbers for each setup. The compaction number includes flush bytes. Write-amp is calculation as Compaction bytes/WAL bytes. The summary numbers are produced after a manual compaction compacts everything to L6.
1eng-loose: Write Bytes WAL: 1.1 GiB, Compaction: 5.9 GiB, W-Amp(C/W): 5.45
1eng-tight: Write Bytes WAL: 1.1 GiB, Compaction: 4.4 GiB, W-Amp(C/W): 4.06
2eng-loose: Write Bytes WAL: 1.1 GiB, Compaction: 4.3 GiB, W-Amp(C/W): 3.70
2eng-tight: Write Bytes WAL: 1.1 GiB, Compaction: 3.3 GiB, W-Amp(C/W): 2.82
We expect 1eng-tight to have lower write-amp than 1eng-loose and 2eng-tight to have lower write-amp than 2eng-loose, and the numbers are consistent with that expectation. This is because the loose ones need to wait for the raft log entries to flush before the corresponding state machine changes are durable, and then the truncation is applied.
The surprise was 2eng-loose having lower write-amp than 1eng-tight. Also, we would expect 1eng-tight to be the same as 2eng-tight given that only a tiny amount of the raft log made its way out of the memtable in the tight case. The suspicion was that the better behavior of 2eng is due to the unfair advantage that 2eng has in terms of having 2 64MB memtables, so they can hold more in-memory before flushing, which means fewer flushes and less compaction out of L0 etc.
The larger data volume ensures there is ~10GB in the state machine at the end of the experiment, so levels L3-L6 are populated. The summary numbers:
1eng-loose: Write Bytes WAL: 21 GiB, Compaction: 211 GiB, W-Amp(C/W): 10.01
1eng-tight: Write Bytes WAL: 21 GiB, Compaction: 190 GiB, W-Amp(C/W): 9.00
2eng-loose: Write Bytes WAL: 21 GiB, Compaction: 157 GiB, W-Amp(C/W): 7.41
2eng-tight: Write Bytes WAL: 21 GiB, Compaction: 141 GiB, W-Amp(C/W): 6.68
Again we see both 2eng setups being better than 1eng-tight.
1eng-loose: Write Bytes WAL: 2.2 GiB, Compaction: 11 GiB, W-Amp(C/W): 5.14
1eng-tight: Write Bytes WAL: 2.2 GiB, Compaction: 7.9 GiB, W-Amp(C/W): 3.63
2eng-loose: Write Bytes WAL: 2.2 GiB, Compaction: 9.6 GiB, W-Amp(C/W): 4.40
2eng-tight: Write Bytes WAL: 2.2 GiB, Compaction: 7.9 GiB, W-Amp(C/W): 3.60
Once we remove the unfair advantage of the memtable size, we see that 1eng-tight and 2eng-tight are comparable. Interestingly, 1eng-loose is worse than 2eng-loose, even though the number of truncations and the indexes that were truncated matched up. We noticed more bytes were written to L0 for 1eng-loose than 2eng-loose. We hypothesize that this difference is because there is a slight difference in when the sm-eng (state machine engine) memtable flushes in the 2eng case, which allows some raft log entries to be truncated before the raft-eng memtable flushes.
The remaining experiments all use this memtable size configuration.
This was done to see if the comparisons from the previous experiment held up with a larger data volume. NB: the following results are repeatable (the largest delta of write-amp in a different run was 0.03).
1eng-loose: Write Bytes WAL: 21 GiB, Compaction: 187 GiB, W-Amp(C/W): 8.84
1eng-tight: Write Bytes WAL: 21 GiB, Compaction: 157 GiB, W-Amp(C/W): 7.41
2eng-loose: Write Bytes WAL: 21 GiB, Compaction: 156 GiB, W-Amp(C/W): 7.41
2eng-tight: Write Bytes WAL: 21 GiB, Compaction: 142 GiB, W-Amp(C/W): 6.70
Interestingly, despite no unfairness in memtable sizing, 2eng-tight is better than 1eng-tight and 2eng-loose is comparable to 1eng-tight. Looking at 2eng-tight, we see the raft-eng has only 96MB being written to L0 (the rest is truncated before the flush):
__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
WAL 1 44 M - 11 G - - - - 11 G - - - 1.0
0 0 0 B 0.00 11 G 0 B 0 0 B 0 96 M 198 0 B 0 0.0
So ~1% of what was written to the memtable is not truncated at flush time, which is because doing truncation is not perfectly aligned with filling up of a memtable. Looking at the details of the LSM shape, we hypothesize that a similar 96MB being flushed to L0 in the 1eng case could have an outsize influence in later compactions from L0 to L3 and so on down to L6, since the 2eng case has another advantage: the file count target for L0 and the size targets for lower levels are shared for the 1eng case.
This is designed to nullify the advantage of tightly-coupled truncation since 64KB truncation threshold * 5000 = ~320MB, which is much larger than the ~64MB of memtable capacity being used by raft entries. We expect loosely-coupled truncation to do well here. The main reason to run this is to understand the write amplification cost of storing the raft logs in an LSM, when most of the raft log bytes do get flushed to L0.
1eng-loose: Write Bytes WAL: 21 GiB, Compaction: 217 GiB, W-Amp(C/W): 10.27
1eng-tight: Write Bytes WAL: 21 GiB, Compaction: 205 GiB, W-Amp(C/W): 9.67
2eng-loose: Write Bytes WAL: 21 GiB, Compaction: 197 GiB, W-Amp(C/W): 9.32
2eng-tight: Write Bytes WAL: 21 GiB, Compaction: 190 GiB, W-Amp(C/W): 8.98
Another experiment that used SingleDel to truncate the log in all these cases showed no improvement.
Ideas for improvement (if this is a realistic scenario):
L0CompactionFileThreshold
defaults to 500, so a single flush with 5000 sstables would exceed that threshold.L0CompactionThreshold
, which we set to 2 in CockroachDB. This may be ok since raftentry.Cache
reduces the need to read back raft log entries.One advantage of separating the engines is that the write cost of the raft log becomes very obvious and it becomes more realistic to work on optimizing it.
The observed throughput in the earlier experiments for the four alternatives was the same. To investigate, we looked at pebble.InternalIntervalMetrics
, that provide information about the WAL writer throughput and the WAL sync latency distribution. This confirmed that the observed throughput is constrained by WAL throughput of the engine containing the raft log. Observations:
@tbg @nvanbenschoten @erikgrinaker @cockroachdb/storage
Thanks for running these experiments @sumeerbhola! I think this just confirms that using an LSM for the Raft log is suboptimal: a write-amp of up to 10 seems pretty bad, and this work is mostly wasted since we're ideally only reading this data once.
Can we do any further optimizations to the Pebble engine used for the log to get this number down? E.g. reduce the number of levels, reduce the compaction aggressiveness, increase the memtable size, and so on? It would be good to get this down into the 2-4 range in the typical case under high write volume.
Longer-term, I think we should consider using a different storage engine -- ideally, we should be able to get this down to two write operations: commit and truncation. Any thoughts on this?
Playing around with the L0CompactionThreshold helps. The first LSM is using the default of 2 and the second one is using 10. With 10, the read amp stays <= 6, so I think this is a viable setting for the raft log.
Note that if the log is not going to be truncated soon and since we want quick lookup by index, one can't avoid flushing. So the 11GB in the WAL + 10GB of write to L0 probably can't be avoided (without a different indexing structure that directly indexes the WAL). On top of that we are paying 23GB. Increasing the L0CompactionThreshold further had diminishing returns, but I did not investigate in a lot of detail. Regarding using a different storage engine, I'm not very inclined towards it -- these things initially look simple but over time become a huge effort. If we start running with separated engines, and the raft engine with a higher L0CompactionThreshold, and then see that real production cases resemble this 5000 active ranges setup, I think we can come up with techniques to optimize Pebble for this use case (e.g. value separation like WiscKey, discussed in https://github.com/cockroachdb/pebble/issues/112, since the keys are tiny in this case).
L0CompactionThreshold=2
__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
WAL 1 5.5 M - 11 G - - - - 11 G - - - 1.0
0 0 0 B 0.00 11 G 0 B 0 0 B 0 10 G 5.3 K 0 B 0 1.0
1 0 0 B 0.00 0 B 0 B 0 0 B 0 0 B 0 0 B 0 0.0
2 0 0 B 0.00 0 B 0 B 0 0 B 0 0 B 0 0 B 0 0.0
3 0 0 B 0.00 0 B 0 B 0 0 B 0 0 B 0 0 B 0 0.0
4 0 0 B 0.00 0 B 0 B 0 0 B 0 0 B 0 0 B 0 0.0
5 21 64 M 0.99 10 G 0 B 0 78 M 40 19 G 5.7 K 21 G 1 1.9
6 32 184 M - 8.6 G 0 B 0 34 K 12 26 G 4.3 K 35 G 1 3.1
total 53 247 M - 11 G 0 B 0 78 M 52 67 G 15 K 56 G 2 6.2
flush 199
compact 4527 0 B 0 B 0 (size == estimated-debt, score = in-progress-bytes, in = num-in-progress)
L0CompactionThreshold=10
__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
WAL 1 5.5 M - 11 G - - - - 11 G - - - 1.0
0 58 106 M 0.60 11 G 0 B 0 0 B 0 10 G 6.0 K 0 B 3 1.0
1 0 0 B 0.00 0 B 0 B 0 0 B 0 0 B 0 0 B 0 0.0
2 0 0 B 0.00 0 B 0 B 0 0 B 0 0 B 0 0 B 0 0.0
3 0 0 B 0.00 0 B 0 B 0 0 B 0 0 B 0 0 B 0 0.0
4 0 0 B 0.00 0 B 0 B 0 0 B 0 0 B 0 0 B 0 0.0
5 12 46 M 0.93 10 G 0 B 0 0 B 0 8.0 G 2.1 K 12 G 1 0.8
6 17 107 M - 5.9 G 0 B 0 194 M 49 15 G 2.7 K 21 G 1 2.6
total 87 259 M - 11 G 0 B 0 194 M 49 44 G 11 K 33 G 5 4.1
flush 199
compact 1740 303 M 0 B 0 (size == estimated-debt, score = in-progress-bytes, in = num-in-progress)
I think we can come up with techniques to optimize Pebble for this use case
Yeah, that's our only option in the medium-term anyway, so we may as well see how far we can take it.
Thanks for all of the explanations, I was able to follow your write up (and with some looking at the experiment) code feel like I know what's going on and it makes sense.
The IO overload experiments I've been looking into have dwelled on w-amp a lot, too. Most participants (maybe not you, but definitely myself and @nicktrav) did not expect such large multiples (we're seeing 15x most of the time on larger clusters. I agree that to do anything here, we need to separate the raft log out first, so I hope we can get back to that soon.
Playing around with the L0CompactionThreshold helps. The first LSM is using the default of 2 and the second one is using 10. With 10, the read amp stays <= 6, so I think this is a viable setting for the raft log.
Reading this:
// The amount of L0 read-amplification necessary to trigger an L0 compaction.
and since in these experiments, I can't imagine there would be files that can share a sublevel (every SST overlaps every other SST), isn't this just the number of SSTs required until we compact out of L0? So 2 will trigger (or at least suggest) a compaction when the second SST enters L0. Shouldn't you necessarily see a read-amp of >10 if that setting is 10, then? Does it have to do with compactions creating sublevels that drops us below 10?
I also don't have a mental model of how the LSM shape evolves under experiment 2.1. We get mutually overlapping files in L0 that have "a little bit of everything" for each log. When we compact two of them, we are likely both growing and shrinking the result, since log truncations and log writes are mixed. Will the result go to Lbase+ either way or can things stay around in L0? If something makes it "out of L0", isn't it chased by a tombstone soon thereafter but does the tombstone even have a chance of catching up?
Other than the improvements you mentioned, Will multi-level compactions help here? Or maybe even multi-level compactions that scan the source file(s) selectively, only picking up deletion tombstones. That way, they can "thin out" the lower levels without actually transporting more data into a lower level. That somehow ties up with my earlier question about letting tombstones catch up with the values they delete. If tombstones got stored separately from the log entries, could we do this more efficiently? Somehow the problem is that we basically know that most entries are or will soon be deleted, but because everything gets jumbled up in L0 it becomes very hard to do anything clever. Is a PebblesDB-type guard (only for L0) something to consider here, where we split the SSTs by replica? This will create lots of small SSTs, but maybe that's fine since we have no operations that scan across SSTs? If we then also split SSTs into values and deletions upon flush, we should be able to pick compactions that take a deletion input file (we can use rangedels for raft log truncation) and all "older" SSTs and compact them together, which likely means removing all SSTs (ok, now we're assuming a singlerangedel, but otherwise the result is just an empty sst with a rangedel, also good). For example:
---------keys--------------------------->
/| log 1 | log 2 | ... | log 50k |
L0 / | [3.put.sst] | ...
\ | [2.del.sst] | ...
\| [1.put.sst] | ...
----------------------------------------
L1 (no partitions)
[sst with log entries that stuck around for a long time ]
Assuming all 50k "partitions" look like log1
s, a compaction could pick, from each partition (and combining enough partitions to create a large enough output file), the newest deletion SST and all of the prior SSTs, knowing that a "compaction" basically means throwing all of them away in the common case. There sure are a lot of files, though, but most of them are just that... files, it's not like we need to touch them a lot and if we do, it will be on a per-log basis and with a strong bias towards the most recent files.
This all verges (probably deeply) into the "custom engine" territory (and has lots of problems I didn't think to consider), and I trust your intuition to put that off ~indefinitely. But it's immensely helpful for my intuitive and actual grasp of the topic to spitball these kinds of ideas.
If we then also split SSTs into values and deletions upon flush, we should be able to pick compactions that take a deletion input file (we can use rangedels for raft log truncation) and all "older" SSTs and compact them together, which likely means removing all SSTs (ok, now we're assuming a singlerangedel, but otherwise the result is just an empty sst with a rangedel, also good).
FYI, delete-only compactions are an optimization which Pebble already implements. Splitting sstables on replica boundaries could blow up the number of sstables, but perhaps that is worthwhile if doing so could take advantage of delete-only compactions such that the old sstables are simply removed without ever being read for a compaction.
I prototyped a table partitioning API a long time ago. This would give similar functionality to PebblesDB-type guards. Very little work to do there, just didn't have an immediate use for it so it was never pushed over the finish line.
I used https://github.com/cockroachdb/cockroach/issues/82109 to get some w-amp numbers as well. See the slack thread for more:
Currently, to support 4mb/s of uniformly random write throughput (400 ops/sec of 10kb random data), we need to write to disks at around ~150mb/s, and read at ~60mb/s (these numbers go up slowly as time passes, they were read off around 11h in). The LSM w-amp on these nodes is ~14.2. That means that 150/(4*14.2) = 2.6 is the additional w-amp factor introduced by the fact that we write everything twice: once to the raft log, and then once to the state machine. Isn't that surprising? I thought we would eat at most an additional factor of 2x, but hopefully a little less (since some raft log entries might be deleted while still in the memtable).
I am curious whether 2.6 is at all expected here. I might be holding something wrong.
I ran some alternative computation using the reported total disk bytes written vs the total sql bytes received (source) and this came out to a "real end to end" w-amp of 15.8 vs the 14.2 reported by pebble, so a factor of 1.1 and change. I need to think more about what this actually means but it strikes me as unexpectedly favorable. We hand everything to pebble twice (once on append and once on apply), so the factor being small seems to imply that the first write (append entry to WAL) mostly doesn't need to get flushed to L0, i.e. it incurs a w-amp much smaller than the apply write, which will always traverse the entire LSM (since unlike the raft log entries it doesn't get deleted).
We have 450 replicas here, which probably explains it. The memtable is 64mb, so it takes ~6400 writes to fill up the memtable. That's 6400/450 = 14.2 writes per replica. 14.2*10k is 142k which is larger than the 64kb threshold at which we start log truncations.
So it's plausible that in this workload, most raft entries are deleted in-place (though the deletions also add to the memtable, so we would have to divide the 6400 by half but then it still fits) before getting flushed.
Nice experiments and writeup @sumeerbhola! The note about the "fsync side-effects" section in https://lwn.net/Articles/842385/ and then the experimental results to back it up are especially interesting.
In light of that article, one thing that I don't understand is why the syncs for the two WALs in the 2eng experiments had significantly different latency profiles. Also, if an fsync on one file has the effect of syncing all files, why was WALBytesPerSync
needed for the state machine WAL but not needed for the raft log WAL? Wouldn't we expect the frequent syncing of the raft log WAL to obviate the need for WALBytesPerSync
?
Another experiment that used SingleDel to truncate the log in all these cases showed no improvement.
Does this make sense, given that we don't see the raft engine's LSM expand beyond a single level? Or did we see the raft engine's LSM grow to multiple levels and still not see an improvement using SingleDel?
Either way, it's worth mentioning those results in https://github.com/cockroachdb/cockroach/issues/8979.
In light of that article, one thing that I don't understand is why the syncs for the two WALs in the 2eng experiments had significantly different latency profiles.
I don't have an answer. I may be interpreting that article incorrectly -- maybe the "sync everything" behavior is limited to the journal. There are old threads like this one https://groups.google.com/g/fa.linux.kernel/c/ogRJPoaso9g/m/ysmSK69ylS8J, which also seem related but I can't quite make sense of them.
Can't we test to see if the "sync everything" behavior is actually taking place and under what circumstances? Create two files of dirty data and call fsync and fdatasync on only one of them. Using the Linux I/O Profiling tools to see what is being written out at the block layer. Note that I'd want to see if fsync differs from fdatasync. Also worth testing sync_file_range (which is what we use for WALBytesPerSync
).
cc @cockroachdb/replication
cc @cockroachdb/replication
@andrewbaptist suggested removing the fsync interference caused by sharing the filesystem by instead having 2 filesystems on the same block device. There are two options for this filesystem split:
A slightly different alternative is to not use fsync on either WAL and instead write them using AIO/O_DIRECT. If we go down that path, we would want to preallocate the WAL and then treat it as a circular buffer. Using the O_DIRECT path would mean we would need to be more explicit about our writes (need to be 4K aligned), but if we do that, then we also get some of the benefits of having an async interface to the storage layer since only the WAL write is in the critical path. This could also be done with io_uring, but I don't know a lot about that...
The Pebble WAL is not pre-allocated (well it is, but not in the way that you're describing), but the WAL recycling functionality means it is more or less a circular buffer when operating in the steady state. We don't seem to have a good description of WAL recycling in a comment or design doc, but @jbowens or @sumeerbhola know where the bodies are buried (i.e. how this all works).
Umbrella issue for #16361.
Jira issue: CRDB-12188
Epic CRDB-220