kvserver: separate raft log

irfansharif commented 7 years ago

Umbrella issue for #16361.

Jira issue: CRDB-12188

Epic CRDB-220

irfansharif commented 7 years ago

~  env COCKROACH_DEDICATED_RAFT_STORAGE=DISABLED make bench PKG=./pkg/storage BENCHES="BenchmarkReplicaRaftStorage" TESTFLAGS="-count 10"  > perf-disabled
~  env COCKROACH_DEDICATED_RAFT_STORAGE=ENABLED make bench PKG=./pkg/storage BENCHES="BenchmarkReplicaRaftStorage" TESTFLAGS="-count 10"  > perf-enabled
~  benchstat perf-enabled perf-disabled

  name                             old time/op  new time/op  delta
  ReplicaRaftStorage/vs=1024-4     1.70ms ± 4%  0.44ms ±12%  -73.90%  (p=0.000 n=10+10)
  ReplicaRaftStorage/vs=4096-4     1.91ms ± 2%  0.63ms ±11%  -66.95%  (p=0.000 n=10+10)
  ReplicaRaftStorage/vs=16384-4    2.74ms ± 2%  1.46ms ± 4%  -46.68%  (p=0.000 n=10+10)
  ReplicaRaftStorage/vs=65536-4    5.16ms ± 5%  4.38ms ± 3%  -15.10%  (p=0.000 n=10+10)
  ReplicaRaftStorage/vs=262144-4   13.6ms ± 4%  16.4ms ± 3%  +20.51%  (p=0.000 n=10+10)
  ReplicaRaftStorage/vs=1048576-4  50.9ms ± 3%  71.0ms ± 4%  +39.70%  (p=0.000 n=10+8)

Early, untuned & possibly incorrect benchmarks. Will update comment once verified.

petermattis commented 7 years ago

Any insight as to why performance gets worse with a dedicated Raft storage engine at larger value sizes?

irfansharif commented 7 years ago

Not yet, no. It shouldn't happen as far as I can tell, doesn't align with the early experiments done in #16361 either. Investigating.

irfansharif commented 7 years ago

there was particularly ill-suited reflect.DeepEqual debug assertion here skewing things. That aside we have:

~ benchstat perf-disabled perf-enabled

  name                             old time/op  new time/op  delta
  ReplicaRaftStorage/vs=1024-4      320µs ± 6%   385µs ±18%  +20.29%  (p=0.000 n=10+10)
  ReplicaRaftStorage/vs=4096-4      613µs ±14%   580µs ± 2%     ~     (p=0.278 n=10+9)
  ReplicaRaftStorage/vs=16384-4    2.59ms ± 3%  2.05ms ± 4%  -20.87%  (p=0.000 n=10+9)
  ReplicaRaftStorage/vs=65536-4    4.11ms ± 7%  3.29ms ± 3%  -19.97%  (p=0.000 n=10+10)
  ReplicaRaftStorage/vs=262144-4   13.4ms ± 8%  10.7ms ± 3%  -20.39%  (p=0.000 n=10+10)
  ReplicaRaftStorage/vs=1048576-4  56.8ms ± 3%  36.4ms ± 2%  -35.91%  (p=0.000 n=10+10)

tbg commented 7 years ago

This one? https://github.com/cockroachdb/cockroach/issues/16591

On Thu, Jun 29, 2017, 22:02 irfan sharif notifications@github.com wrote:

there was particularly ill-suited reflect.DeepEqual debug assertion here skewing things. That aside we have:

~ benchstat perf-disabled perf-enabled

name old time/op new time/op delta ReplicaRaftStorage/vs=1024-4 320µs ± 6% 385µs ±18% +20.29% (p=0.000 n=10+10) ReplicaRaftStorage/vs=4096-4 613µs ±14% 580µs ± 2% ~ (p=0.278 n=10+9) ReplicaRaftStorage/vs=16384-4 2.59ms ± 3% 2.05ms ± 4% -20.87% (p=0.000 n=10+9) ReplicaRaftStorage/vs=65536-4 4.11ms ± 7% 3.29ms ± 3% -19.97% (p=0.000 n=10+10) ReplicaRaftStorage/vs=262144-4 13.4ms ± 8% 10.7ms ± 3% -20.39% (p=0.000 n=10+10) ReplicaRaftStorage/vs=1048576-4 56.8ms ± 3% 36.4ms ± 2% -35.91% (p=0.000 n=10+10)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/16624#issuecomment-312154888, or mute the thread https://github.com/notifications/unsubscribe-auth/AE135J3qhkvLqnrSUqm-i2Zy8XFC-dWPks5sJFcdgaJpZM4N_3iO .

--

-- Tobias

irfansharif commented 7 years ago

nope, I had sprinkled them all over when implementing this to guarantee parity across the two instances. Just forgot to remove all of them when posting the first benchmark.

petermattis commented 7 years ago

Have you experimented with disabling syncing when applying Raft log truncations? That would be the ideal scenario with regards to syncing of the non-Raft storage engine.

On Mon, Jul 17, 2017 at 3:08 PM, irfan sharif notifications@github.com wrote:

The following are some results from experiments to determine the relative lack of speed ups from #16809 https://github.com/cockroachdb/cockroach/pull/16809 (pardon the verbosity it's primarily self-documentation). I ran two long running experiments on navy, ycsb -workload F (pure writes) with the same initial-load. From 7/13 04:00 (grafana time) to 7/13 21:11 for the first run (with new changes), second from 7/13 21:27 to 7/14 14:39 First, the results:

--- Before 17h21m56s 1145.0 0 / 0 / 0 1145 / 0 0 / 0 elapsedops/sec(total)errors(total) 62516.6s 919.5 0

--- After 17h49m5s 1099.0 0 / 0 / 0 1099 / 0 0 / 0 elapsedops/sec(total)errors(total) 64145.9s 896.2 0

Overall a 2.59% decrease in throughput. I did the same for a shorter run:

--- Before 1h12m25s 1084.9 0 / 0 / 0 1085 / 0 0 / 0 elapsedops/sec(total)errors(total) 4345.3s 900.9 0

--- After 1h18m40s 1060.0 0 / 0 / 0 1060 / 0 0 / 0 elapsedops/sec(total)errors(total) 4720.8s 877.0 0

Similarly, 2.62% decrease in throughput.

Here are some interesting graphs from the longer run, the first half is navy with the dedicated raft engine changes and the second without: [image: image] https://user-images.githubusercontent.com/10536690/28283345-7a5f13dc-6afb-11e7-8352-fe73ac16c1ed.png

Overall the change, as is, seems to increase Raft command commit latency. Picking navy-006 to remove the outlier: [image: image] https://user-images.githubusercontent.com/10536690/28283531-f96ee1a2-6afb-11e7-94d8-fe8c4833d781.png

We have more CGo calls than we did before (this is expected, we're separately committing to a whole another engine without any form of batching across the two): [image: image] https://user-images.githubusercontent.com/10536690/28283518-efbf52fe-6afb-11e7-835d-4639cdf0ded9.png

We have slightly longer raft log processing times: [image: image] https://user-images.githubusercontent.com/10536690/28283883-404689e4-6afd-11e7-89dd-253714211b72.png

Some other asides:

Overall memory usage goes slightly higher [image: image] https://user-images.githubusercontent.com/10536690/28283595-287417f6-6afc-11e7-9d85-798207072470.png

We seem to have more replicas/ranges per node (I don't know if this a valid inference, or why this comes about) [image: image] https://user-images.githubusercontent.com/10536690/28283706-8b16a554-6afc-11e7-9038-6dfea3725f45.png

Ignore the graph headings but here are RocksDB stats for the new engine [image: image] https://user-images.githubusercontent.com/10536690/28284158-0f742596-6afe-11e7-8bba-4dde871db3bd.png

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/16624#issuecomment-315851707, or mute the thread https://github.com/notifications/unsubscribe-auth/AF6f99-Pl6b_hIkgyXXqFfoJoGVm4pvpks5sO7EwgaJpZM4N_3iO .

irfansharif commented 7 years ago

The following are some results from experiments to determine the relative lack of speed ups from #16809 (pardon the verbosity it's primarily self-documentation). I ran two long running experiments on navy, ycsb -workload F (pure writes) with the same initial-load. From 7/13 04:00 (grafana time) to 7/13 21:11 for the first run (with new changes), second from 7/13 21:27 to 7/14 14:39.

First, the results:

--- Before
17h21m56s       1145.0           0 / 0 / 0        1145 / 0           0 / 0
elapsed__ops/sec(total)__errors(total)
  62516.6s          919.5              0

--- After
17h49m5s       1099.0           0 / 0 / 0        1099 / 0           0 / 0
elapsed__ops/sec(total)__errors(total)
  64145.9s          896.2              0

Overall a 2.59% decrease in throughput. I did the same for a shorter run:

--- Before
1h12m25s       1084.9           0 / 0 / 0        1085 / 0           0 / 0
elapsed__ops/sec(total)__errors(total)
  4345.3s          900.9              0

--- After
1h18m40s       1060.0           0 / 0 / 0        1060 / 0           0 / 0
elapsed__ops/sec(total)__errors(total)
  4720.8s          877.0              0

Similarly, 2.62% decrease in throughput.

Here are some interesting graphs from the longer run, the first half is navy with the dedicated raft engine changes and the second without:

Overall the change, as is, seems to increase Raft command commit latency. Picking navy-006 to remove the outlier:

We have more CGo calls than we did before (this is expected, we're separately committing to a whole another engine without any form of batching across the two):

We have slightly longer raft log processing times:

I tried using go-torch after, here are some flame graphs operating on the same data:

Before:

After:

They both mostly have the same structure (good, expected) except after we have a much wider second flame (I don't know what else to call it). Looking into this further using callgrind (screenshots not shown) I was able to confirm a larger proportion of time spent in CGo land, happening right about here:

         .          .   4540:   isLogTruncationRequest := rResult.RaftLogDelta != nil
         .      3.62s   4541:   if err := batch.Commit(isLogTruncationRequest); err != nil {
         .          .   4542:           return enginepb.MVCCStats{}, roachpb.NewError(NewReplicaCorruptionError(
         .          .   4543:                   errors.Wrap(err, "could not commit batch")))
         .          .   4544:   }
         .      700ms   4545:   if err := raftBatch.Commit(isLogTruncationRequest); err != nil {
         .          .   4546:           return enginepb.MVCCStats{}, roachpb.NewError(NewReplicaCorruptionError(
         .          .   4547:                   errors.Wrap(err, "could not commit raft batch")))
         .          .   4548:   }
      10ms       10ms   4549:   elapsed := timeutil.Since(start)
         .       30ms   4550:   r.store.metrics.RaftCommandCommitLatency.RecordValue(elapsed.Nanoseconds())
         .       10ms   4551:   return rResult.Delta, nil
         .          .   4552:}
         .          .   4553:

I tried disabling raft log syncing (for both engines, before and after these changes to get a fair comparison) but still saw similar results. As for what I think is happening is our current batching process for RocksDB batch commits. They are specific to individual engines and now that we have two, we batch all writes for the first engine and then separately do the same for the second. This creates this two-phased batching process which might explain the drop in throughput. Compare this to a better batching strategy batching all writes going down to RocksDB, regardless of the engine addressed. Looking into what else can be shared across the two instances, by default the total number of compaction threads are but there probably are more (see include/rocksdb/env.h).

irfansharif commented 7 years ago

Some other asides:

Overall memory usage goes slightly higher
We seem to have more replicas/ranges per node (I don't know if this a valid inference, or why this comes about)
Ignore the graph headings but here are RocksDB stats for the new engine. I'm surfacing these as separate metrics to track.

irfansharif commented 7 years ago

+cc @tschottdorf.

petermattis commented 7 years ago

I tried disabling raft log syncing (for both engines, before and after these changes to get a fair comparison) but still saw similar results. As for what I think is happening is our current batching process for RocksDB batch commits. They are specific to individual engines and now that we have two, we batch all writes for the first engine and then separately do the same for the second. This creates this two-phased batching process which might explain the drop in throughput. Compare this to a better batching strategy batching all writes going down to RocksDB, regardless of the engine addressed.

Yeah, that does seem problematic. When adding the dedicated syncing goroutine, I considered adding a dedicated goroutine to perform the commits. With some restructuring, I think we could have a dedicated per-engine goroutine for committing batches, queue both batches and then wait for both of the commits to happen.

irfansharif commented 7 years ago

Rebasing on top of new master I'm seeing 0 QPS occurring quite frequently, this is ycsb -workload F on ultramarine:

  9m17s          0.0           0 / 0 / 0           0 / 0           0 / 0
  9m18s          0.0           0 / 0 / 0           0 / 0           0 / 0
  9m19s          0.0           0 / 0 / 0           0 / 0           0 / 0
  9m20s          0.0           0 / 0 / 0           0 / 0           0 / 0
elapsed______ops/sec__reads/empty/errors___writes/errors____scans/errors
  9m21s          0.0           0 / 0 / 0           0 / 0           0 / 0
  9m22s          0.0           0 / 0 / 0           0 / 0           0 / 0
  9m23s        116.0           0 / 0 / 0         116 / 0           0 / 0
  9m24s        499.0           0 / 0 / 0         499 / 0           0 / 0
  9m25s        420.0           0 / 0 / 0         420 / 0           0 / 0
  9m26s          0.0           0 / 0 / 0           0 / 0           0 / 0
  9m27s          0.0           0 / 0 / 0           0 / 0           0 / 0
  9m28s          0.0           0 / 0 / 0           0 / 0           0 / 0
  9m29s          0.0           0 / 0 / 0           0 / 0           0 / 0
  9m30s        405.0           0 / 0 / 0         405 / 0           0 / 0
  9m31s         59.0           0 / 0 / 0          59 / 0           0 / 0
  9m32s          0.0           0 / 0 / 0           0 / 0           0 / 0
  9m33s          0.0           0 / 0 / 0           0 / 0           0 / 0
   ...
elapsed__ops/sec(total)__errors(total)
 773.4s          155.4              0

This is new post-rebase, I suspect it's due to #16942 and the reworked batching mechanisms. Isolating a build without those changes.

For posterity these are the numbers for without the raft log changes:

elapsed______ops/sec__reads/empty/errors___writes/errors____scans/errors
   6m1s        558.0           0 / 0 / 0         558 / 0           0 / 0
   6m2s        541.0           0 / 0 / 0         541 / 0           0 / 0
   6m3s        596.0           0 / 0 / 0         596 / 0           0 / 0
   6m4s        553.0           0 / 0 / 0         553 / 0           0 / 0
   6m5s        629.0           0 / 0 / 0         629 / 0           0 / 0
   6m6s        538.0           0 / 0 / 0         538 / 0           0 / 0
   6m7s        566.0           0 / 0 / 0         566 / 0           0 / 0
   6m8s        597.0           0 / 0 / 0         597 / 0           0 / 0
 ... 
elapsed__ops/sec(total)__errors(total)
 368.0s          457.6              0

No zero QPS, almost a thrice as fast (!).

irfansharif commented 7 years ago

Nope, something else going on here. Here's a build reverting three related commits to the new batching/syncing mechanism:

a86bba2f6 Revert "storage/engine: use SyncWAL instead of sync on commit"
f9379de0e Revert "storage/engine: unify code to commit write-only and read-write batches"
ec8bd15c0 Revert "storage/engine: lazily initialize the RocksDB batch"

   4m1s          0.0           0 / 0 / 0           0 / 0           0 / 0
   4m2s          0.0           0 / 0 / 0           0 / 0           0 / 0
   4m3s        244.0           0 / 0 / 0         244 / 0           0 / 0
   4m4s         12.0           0 / 0 / 0          12 / 0           0 / 0
   4m5s          0.0           0 / 0 / 0           0 / 0           0 / 0
   4m6s          0.0           0 / 0 / 0           0 / 0           0 / 0
   4m7s          0.0           0 / 0 / 0           0 / 0           0 / 0
   4m8s          0.0           0 / 0 / 0           0 / 0           0 / 0
   4m9s          0.0           0 / 0 / 0           0 / 0           0 / 0
  4m10s          0.0           0 / 0 / 0           0 / 0           0 / 0
  4m11s        280.0           0 / 0 / 0         280 / 0           0 / 0
 ... 
elapsed__ops/sec(total)__errors(total)
 251.3s          245.9              0

irfansharif commented 7 years ago

Zero QPS points are during lots of completely idle CPU time:

petermattis commented 7 years ago

0 QPS like that often indicates a problem with leases.

irfansharif commented 7 years ago

Circling back to this issue, still visible after restructuring around truncations moving downstream and happening on it's own dedicated queue after the fact ("truncations" only update TruncatedState, actual deletion of log entries happen after the fact). I'm sure there's a very simple oversight causing this but still investigating. Below are some graphs from a longer run last night where we observe gradual degradation to 0 QPS point, my initial theory is some funky interaction with the existing raft log queue (queueing truncations) and the snapshot queue (queueing snapshots; on initial investigation however the requests seem to originate out of raft directly). We see continuous cycle of snapshot generation + application for the same small subset of ranges (this should have been mitigated after the proposal quota pool, I've come full circle). The workload here is ycsb -workload F -splits 1000 -concurrency 1000. Forgive the verbosity/noise, I don't know what's relevant as yet.

Slow degradation in throughput followed by intermittent peaks:

Enqueued replicas despite 0 throughput, also a lot of failures:

Surge of non-preemptive snapshots generated (master doesn't show this for the same workload):

Raft log self behind creeps up:

Replicas turn quiescent:

Other side effects, raft is blocked as well:

irfansharif commented 7 years ago

Seeing a lot of snapshot generation/application isolated to few specific ranges:

I170725 03:58:16.971243 1078385 storage/replica_raftstorage.go:769  [n1,s1,r492/1:/Table/51/1/{691283…-701234…}] applying Raft snapshot at index 8570 (id=1d594482, encoded size=8836147, 34 rocksdb batches, 24 log entries)
I170725 03:58:17.112937 1078385 storage/replica_raftstorage.go:777  [n1,s1,r492/1:/Table/51/1/{691283…-701234…}] applied Raft snapshot in 142ms [clear=0ms batch=99ms entries=0ms commit=42ms]
I170725 03:58:18.162192 1083635 storage/replica_raftstorage.go:769  [n1,s1,r492/1:/Table/51/1/{691283…-701234…}] applying Raft snapshot at index 8619 (id=194c1442, encoded size=8895367, 34 rocksdb batches, 30 log entries)
I170725 03:58:18.315653 1083635 storage/replica_raftstorage.go:777  [n1,s1,r492/1:/Table/51/1/{691283…-701234…}] applied Raft snapshot in 153ms [clear=5ms batch=57ms entries=1ms commit=90ms]
I170725 03:58:19.455216 1088924 storage/replica_raftstorage.go:769  [n1,s1,r492/1:/Table/51/1/{691283…-701234…}] applying Raft snapshot at index 8665 (id=9afacbdb, encoded size=8941200, 34 rocksdb batches, 29 log entries)
I170725 03:58:19.535733 1088924 storage/replica_raftstorage.go:777  [n1,s1,r492/1:/Table/51/1/{691283…-701234…}] applied Raft snapshot in 80ms [clear=0ms batch=36ms entries=0ms commit=44ms]
I170725 03:58:20.617621 1094108 storage/replica_raftstorage.go:769  [n1,s1,r492/1:/Table/51/1/{691283…-701234…}] applying Raft snapshot at index 8703 (id=07eca121, encoded size=8972789, 35 rocksdb batches, 24 log entries)
I170725 03:58:20.861394 1094108 storage/replica_raftstorage.go:777  [n1,s1,r492/1:/Table/51/1/{691283…-701234…}] applied Raft snapshot in 244ms [clear=0ms batch=174ms entries=1ms commit=69ms]
I170725 03:58:21.939922 1099239 storage/replica_raftstorage.go:769  [n1,s1,r492/1:/Table/51/1/{691283…-701234…}] applying Raft snapshot at index 8748 (id=59467f09, encoded size=9017571, 35 rocksdb batches, 23 log entries)
I170725 03:58:22.004040 1099239 storage/replica_raftstorage.go:777  [n1,s1,r492/1:/Table/51/1/{691283…-701234…}] applied Raft snapshot in 64ms [clear=0ms batch=8ms entries=0ms commit=55ms]
I170725 03:58:23.110255 1104397 storage/replica_raftstorage.go:769  [n1,s1,r492/1:/Table/51/1/{691283…-701234…}] applying Raft snapshot at index 8802 (id=f619d9ce, encoded size=9087894, 35 rocksdb batches, 33 log entries)

bdarnell commented 7 years ago

I'm not sure what it means yet, but I think that last graph is going to be the key: we have periodic bursts of messages getting through, coinciding with an increase in the level of dropped messages (and each time the spikes/increases get smaller). It looks like it repeats at regular intervals, but not at a round number (11 minutes?)

This suggests problems in the raft transport, maybe flow control related. I'd try increasing initialConnWindowSize in rpc/context.go.

irfansharif commented 7 years ago

I'm sure there's a very simple oversight causing this but still investigating

Found the issue, it was due to Replica.mu.raftLogSize not getting updated correctly despite actual truncations taking place. This happened after a live "migration" over to the new engine when upstream of raft the "simulated truncation" (https://github.com/cockroachdb/cockroach/blob/7ca57b6cce19339e811fafc0421b90b978f098e0/pkg/storage/replica_command.go#L1927) did not play well with "actual" truncations happening downstream. The result of this was that despite "actual" truncations happening (log entries got deleted in RockDB), the raft log queue still accrued replicas as individual Replica.mu.raftLogSize's grew indefinitely. All the following emergent side effects could be traced down to just that.

The graphs looks steady following a 10hr run last night, albeit with a reduced throughput.

irfansharif commented 7 years ago

Here are the new degradations (implementation is as yet incomplete, awkward state debugging the previous issue), things to look out for:

Still we have more replicas queued for log truncations (I have a handle on why, addressing):

Despite this we seem to be processing fewer replica log truncations:

Side effect perhaps, raft throughput takes a hit:

Most pressing, raft log commit latencies shoot way up. To a lesser degree the same is true for commit latencies. For this run log truncations, though processed downstream, happen just as frequently. The Raft RocksDB instance is untuned (same defaults as the base engine). Investigating.

irfansharif commented 7 years ago

Another stab at this, w/ dedicated raft log gc queue "actually" deleting after the fact. Live migration takes place around 16:52, we maintain steady performance for a while before eventual 0 QPS again. Similar symptoms as before, and at first glance similar causes (over-aggressive raft log truncations).

Dropping throughput, eventually zero, starting at 16:52:

Uptick in replicas queued in the raft log queue, again, starting at 16:52:

Again we eventually get to cyclic snapshot generation:

Right around then more replicas report themselves to be "behind" (hence the snapshots):

Replicas eventually quiesce, we're now at 0 QPS:

Raft log commit latencies shoot up, raft command commits too but not so much. Consider that the new engine is as yet untuned:

As an aside, some new metrics we're tracking now. Will aid when tuning:

petermattis commented 7 years ago

Heh, I've used the term "snapshot death spiral" for this behavior. Similar problems occurred in the past when Raft log truncation was truncating while we were trying to catch up or rebalance a replica. The heuristics now in raft_log_queue.go are supposed to avoid this, but perhaps something in this PR has broken those heuristics.

irfansharif commented 7 years ago

Fixing the zero QPS issue (more raft log truncation simulation badness, sigh) I'm running into OOM/high RocksDB memory usage behavior.

First two steep runs are running truncations in the separate raft log gc queue using tombstones, fairly infrequently at that. The disk usage for the raft engine just grows out of bound indefinitely until we hit 47 GiB when we are killed. The long running stable run is our disk usage without raft engine enabled where we seem to stay steady under the same constant load (ycsb -workload F -splits 500 -concurrency 500). The differences between the three runs where we max out disk space:

First run, using raft deletion tombstones instead of actual deletions
Second run, actual deletions
Third run, increasing how aggressively the new raft log gc queue runs. This very negatively affects throughput (by 43% or so) but our memory usage, though still growing indefinitely, is somewhat staggered

The blip towards the end of the first run is what happens when the first node is killed having run out of memory and queries are stalled temporarily allowing the other nodes to presumably run compactions and lower disk usage.

Here are some compactions stats right before this happens on n2,s2:

I170802 14:47:44.643162 92 storage/store.go:4386  [n2,s2] sstables (raft eng read amplification = 3):
25438 0 [ 38M 1 ]: 38M
25439 5 [  4M 1 ]: 4M
25440 6 [ 79M 2 ]: 77M 1M
25441 I170802 14:47:44.643226 92 storage/store.go:4387  [n2,s2]
25442 ** Compaction Stats [default] **
25443 Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
25444 ----------------------------------------------------------------------------------------------------------------------------------------------------------
25445   L0      1/0   37.67 MB   0.9      0.0     0.0      0.0       3.7      3.7       0.0   1.0      0.0     49.6        76        99    0.770       0      0
25446   L5      1/0    3.60 MB   0.5      3.5     3.5      0.0       2.8      2.8       0.0   0.8     46.4     37.2        78        47    1.654   6896K   686K
25447   L6      2/0   78.83 MB   0.0      6.5     3.0      3.5       3.6      0.1       0.0   1.2     62.5     34.6       106        63    1.678   9644K  6277K
25448  Sum      4/0   120.09 MB   0.0     10.0     6.5      3.5      10.1      6.6       0.0   2.7     39.3     39.8       260       209    1.243     16M  6963K
25449  Int      0/0    0.00 KB   0.0      1.3     0.8      0.5       1.3      0.9       0.0   2.7     28.4     29.4        46        25    1.837   2117K   849K
Uptime(secs): 4810.2 total, 4810.2 interval
25451 Flush(GB): cumulative 3.693, interval 0.491
25452 AddFile(GB): cumulative 0.000, interval 0.000
25453 AddFile(Total Files): cumulative 0, interval 0
25454 AddFile(L0 Files): cumulative 0, interval 0
25455 AddFile(Keys): cumulative 0, interval 0
25456 Cumulative compaction: 10.09 GB write, 2.15 MB/s write, 9.98 GB read, 2.12 MB/s read, 259.7 seconds
25457 Interval compaction: 1.32 GB write, 0.28 MB/s write, 1.27 GB read, 0.27 MB/s read, 45.9 seconds
25458 Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_byte      s, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count

Here are some after:

29674 I170802 15:07:44.645798 92 storage/store.go:4386  [n2,s2] sstables (raft eng read amplification = 1):
29675 6 [ 7M 1 ]: 7M
29676 I170802 15:07:44.645980 92 storage/store.go:4387  [n2,s2]
29677 ** Compaction Stats [default] **
29678 Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
29679 ----------------------------------------------------------------------------------------------------------------------------------------------------------
29680   L0      0/0    0.00 KB   0.0      0.0     0.0      0.0       3.8      3.8       0.0   1.0      0.0     49.5        78       102    0.766       0      0
29681   L5      0/0    0.00 KB   0.0      3.6     3.6      0.0       2.9      2.9       0.0   0.8     47.0     37.3        80        49    1.623   7267K   743K
29682   L6      1/0    7.23 MB   0.0      6.7     3.0      3.6       3.7      0.0       0.0   1.2     61.2     33.5       112        65    1.718     10M  6628K
29683  Sum      1/0    7.23 MB   0.0     10.3     6.7      3.7      10.3      6.7       0.0   2.7     39.2     39.3       269       216    1.247     17M  7371K
29684  Int      0/0    0.00 KB   0.0      0.1     0.1      0.1       0.0     -0.1       0.0   3.2    171.1     28.6         1         3    0.264    436K   263K
29685 Uptime(secs): 6010.2 total, 6010.2 interval
29686 Flush(GB): cumulative 3.776, interval 0.007
29687 AddFile(GB): cumulative 0.000, interval 0.000
29688 AddFile(Total Files): cumulative 0, interval 0
29689 AddFile(L0 Files): cumulative 0, interval 0
29690 AddFile(Keys): cumulative 0, interval 0
29691 Cumulative compaction: 10.33 GB write, 1.76 MB/s write, 10.32 GB read, 1.76 MB/s read, 269.3 seconds
29692 Interval compaction: 0.02 GB write, 0.00 MB/s write, 0.13 GB read, 0.02 MB/s read, 0.8 seconds
29693 Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_byte      s, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count

Disk size usage before (raft data is only stored in the /raft subdir, the delta is the disk space used by the single base engine):

~ crl-ssh ultramarine all sudo du -sh /mnt/data1/raft
1: cockroach@cockroach-ultramarine-0001.crdb.io
11G     /mnt/data1/raft

2: cockroach@cockroach-ultramarine-0002.crdb.io
11G     /mnt/data1/raft

3: cockroach@cockroach-ultramarine-0003.crdb.io
12G     /mnt/data1/raft

4: cockroach@cockroach-ultramarine-0004.crdb.io
10G     /mnt/data1/raft

~ crl-ssh ultramarine all sudo du -sh /mnt/data1
1: cockroach@cockroach-ultramarine-0001.crdb.io
16G     /mnt/data1

2: cockroach@cockroach-ultramarine-0002.crdb.io
15G     /mnt/data1

3: cockroach@cockroach-ultramarine-0003.crdb.io
16G     /mnt/data1

4: cockroach@cockroach-ultramarine-0004.crdb.io
15G     /mnt/data1

Disk usage after, n1 was killed and restarted. Note the disk usage for the other nodes dropped in the interim.

➜  server git:(dedicated-raft-storage) crl-ssh ultramarine all sudo du -sh /mnt/data1/raft
1: cockroach@cockroach-ultramarine-0001.crdb.io
80M     /mnt/data1/raft

2: cockroach@cockroach-ultramarine-0002.crdb.io
11G     /mnt/data1/raft

3: cockroach@cockroach-ultramarine-0003.crdb.io
12G     /mnt/data1/raft

4: cockroach@cockroach-ultramarine-0004.crdb.io
11G     /mnt/data1/raft

➜  server git:(dedicated-raft-storage) crl-ssh ultramarine all sudo du -sh /mnt/data1
1: cockroach@cockroach-ultramarine-0001.crdb.io
4.8G    /mnt/data1

2: cockroach@cockroach-ultramarine-0002.crdb.io
17G     /mnt/data1

3: cockroach@cockroach-ultramarine-0003.crdb.io
18G     /mnt/data1

4: cockroach@cockroach-ultramarine-0004.crdb.io
17G     /mnt/data1

Before looking into this further, is there something in place currently to bound disk usage for the base engine? I'm surprised it grows so rapidly having simply moved raft log entries + truncations to a separate engine and reducing the frequency in which it occurs. Raft engines for all three runs are using the same config/tunings as the base engine.

petermattis commented 7 years ago

As mentioned offline, there is no limiter on the amount of disk space RocksDB uses. The significant disk growth is very surprising. I wonder if RocksDB isn't compacting sstables in the Raft log engine for some reason. Nothing is jumping out at me in the compaction stats. Do you have compaction stats from the Raft log engine when its disk usage grew to 47GB?

irfansharif commented 7 years ago

The snippets above are the stats from the raft log engine. Investigating.

petermattis commented 7 years ago

The snippets above are the stats from the raft log engine. Investigating.

Huh? The total data size in sstables there is tiny. I guess that makes sense on the one hand, we don't expect the Raft log engine to take much space on disk. So where is the 47GB number coming from?

irfansharif commented 7 years ago

strange, the compaction stats doesn't match up with the files on disk.

I170802 17:39:07.336853 10 storage/store.go:4386  [n1,s1] sstables (raft eng read amplification = 1):
6 [ 2M 1 ]: 2M
I170802 17:39:07.336915 10 storage/store.go:4387  [n1,s1]
** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
----------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   1.0      0.0     23.7         1        15    0.087       0      0
  L6      1/0    1.82 MB   0.0      0.1     0.0      0.0       0.0     -0.0       0.0   0.9     81.5     37.7         1        14    0.047    613K   588K
 Sum      1/0    1.82 MB   0.0      0.1     0.0      0.0       0.1      0.0       0.0   1.8     27.5     28.4         2        29    0.068    613K   588K
 Int      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0         0         0    0.000       0      0

Where I'm supposed to have a single 2MB L6 file, instead I have:

cockroach@cockroach-ultramarine-0001:/mnt/data1/raft$ du -sh *
2.0M    000007.sst
2.4M    000011.sst
2.0M    000014.sst
1.8M    000017.sst
1.7M    000020.sst
1.2M    000023.sst
2.0M    000026.sst
1.5M    000029.sst
2.1M    000032.sst
1.5M    000035.sst
2.1M    000038.sst
2.1M    000041.sst
2.2M    000044.sst
1.4M    000047.sst
55M     000048.log
1.9M    000050.sst
4.0K    auxiliary
4.0K    COCKROACHDB_VERSION
4.0K    CURRENT
4.0K    IDENTITY
0       LOCK
4.0K    MANIFEST-000008
8.0K    OPTIONS-000005

petermattis commented 7 years ago

That's very strange. I'd attempt to reproduce this on a local single-node cluster and try to get to the bottom of it.

irfansharif commented 7 years ago

The issue above was a resource leak, keeping snapshots open leaves on disk SSTs but they don't show up in the compaction stats. TIL.

I put together another change here, obvious in hindsight, where we can get away with flushing the base engine's WAL once for all replicas (as opposed to flushing each time when deleting each replica's raft log entries) and deleting entries up to that compaction point in a dedicated scanner + queue. The way you'd do this is a first pass "mark" phase, marking the low watermark for log entry deletions for all replicas followed by a single base engine flush. After this we simply delete up to the marked point. See #16361 for why this is even necessary. Right now this deletion happens sequentially though it need not be, it can be batched across all ranges to avoid multiple (fast, we use range deletion tombstones) CGo calls but even by not doing that there are measurable perf gains.

Before getting into some of the results, the original motivation to decouple the two engines was due to unnecessary syncing of user level keys when syncing raft log data (described in more detail in #16361). What this change intended to do was to reduce RocksDB commit latencies. If they were synced, they would additionally flush previously unsynced writes. If we maintain two engines however we can ensure synced writes only flush what absolutely needs to be flushed.

As was observed in #16942 (and repeated below), at the time of writing our 99th percentile raft command commit latencies were ~30-40ms. It observed that most of the time unsynced writes were waiting around for an earlier WAL syncing. By introducing a dedicated RocksDB SyncWAL goroutine and slightly changing the semantics of synced writes (add batch to WAL, add entries to mem table, sync WAL instead of add batch to WAL, sync WAL and finally add entries to memtable), we were able to avoid having the unsynced writes wait around for synced writes. This brought down the raft command commit latencies from ~30-40ms to ~6-10ms.

Putting aside raft command commit latency reductions via the implementation in #16942, given what we know of unsynced commands waiting for earlier synced commands flushing the WAL, if indeed we use two separate engines with the base engine never flushing (except for the batched log truncations, but ignored given how infrequent it is) we should similarly see a reduction in raft command commit latencies and an increase in throughput. This is in fact is what we observe.

Two runs, 1hr long on ultramarine running ycsb -workload F -splits 500 -concurrency 500. Before (without the SyncWAL changes and before introducing the dedicated raft engine):

Before (without the SyncWAL changes and after introducing the dedicated raft engine):

I ran the same with SyncWAL changes, verifying the 15% boost alluded to in #16942. Here's a table with all the many combinations (undedicated is using the single base engine for all keys including raft data).

            | pre-syncWAL  |  syncWAL
------------------------------------------
undedicated | 2063.8 ops/s |  2358.0 ops/s (+14.26%)
------------------------------------------
dedicated   | 2215.1 ops/s |  2383.5 ops/s (+7.60%)
              (+7.33%)         (+1.08%)

As for the difference between +7.33% going from (pre-syncWAL,undedicated) to (pre-syncWAL,dedicated) and +14.26% going from (pre-syncWAL,undedicated) to (syncWAL,undedicated), I'm not completely sure. We seem to get an increase in throughput for similar reasons but not quite to the same extent. The two runs can be see here (+7.33%) & here (+14.26%).

Only thing that stood out to me is that 99th percentile raft log commit time jumps up a bit (node 1 is the node ycsb is running on).

Before (without the SyncWAL changes and before introducing the dedicated raft engine):

After (without the SyncWAL changes and after introducing the dedicated raft engine):

My guess would be this is because the new engine, now that it's batching batches addressed to itself, is collecting more "synchronous" batches. Given that previously raft log commits got batched with raft command commits, the latency "evened out" per se. Now that it's isolated to batches with raft data addressed at the same engine, we might just be accreting more "synchronous" batches therefore contributing to slightly higher 99th percentile raft log commit latencies.

Now, after these changes there are certain existing tunings that are suboptimal with the new restructuring of raft log truncations. Consider that "truncation requests" going through the system don't actually truncate the log, but simply update TruncatedState. The actual deletions are deferred, happening after the fact in the dedicated scanner + queue. One simple tuning I did to test this was lower the frequency of raft log truncations by simply increasing RaftLogQueueTimerDuration and RaftLogQueueStaleThreshold:

diff --git i/pkg/storage/raft_log_queue.go w/pkg/storage/raft_log_queue.go
index b43a11841..6adb2c695 100644
--- i/pkg/storage/raft_log_queue.go
+++ w/pkg/storage/raft_log_queue.go
@@ -43,11 +43,11 @@ const (
        // RaftLogQueueTimerDuration is the duration between truncations. This needs
        // to be relatively short so that truncations can keep up with raft log entry
        // creation.
-       RaftLogQueueTimerDuration = 50 * time.Millisecond
+       RaftLogQueueTimerDuration = 100 * time.Millisecond
        // RaftLogQueueStaleThreshold is the minimum threshold for stale raft log
        // entries. A stale entry is one which all replicas of the range have
        // progressed past and thus is no longer needed and can be truncated.
-       RaftLogQueueStaleThreshold = 100
+       RaftLogQueueStaleThreshold = 150
        // RaftLogQueueStaleSize is the minimum size of the Raft log that we'll
        // truncate even if there are fewer than RaftLogQueueStaleThreshold entries
        // to truncate. The value of 64 KB was chosen experimentally by looking at

Another change I made was bumping up the scheduler concurrency, this was motivated by trying to induce more batching at each of the two RocksDB commits (raft log and raft command).

diff --git i/pkg/storage/store.go w/pkg/storage/store.go
index a458d338b..4ff6bae28 100644
--- i/pkg/storage/store.go
+++ w/pkg/storage/store.go
@@ -123,7 +123,7 @@ var defaultRaftElectionTimeoutTicks = envutil.EnvOrDefaultInt(
        "COCKROACH_RAFT_ELECTION_TIMEOUT_TICKS", 15)

 var storeSchedulerConcurrency = envutil.EnvOrDefaultInt(
-       "COCKROACH_SCHEDULER_CONCURRENCY", 12*runtime.NumCPU())
+       "COCKROACH_SCHEDULER_CONCURRENCY", 8*runtime.NumCPU())

 var enablePreVote = envutil.EnvOrDefaultBool(

I ran another set of four runs, similarly 1h long and similar permutations as above, and measured the perf gains.

            | pre-syncWAL  |  syncWAL
------------------------------------------
undedicated | 2252.6 ops/s |  2390.7 ops/s (+6.13%)
------------------------------------------
dedicated   | 2345.1 ops/s |  2472.2 ops/s (+5.42%)
              (+4.11%)        (+3.40%)

The difference between going from (pre-syncWAL,undedicated) to (syncWAL,undedicated) and (pre-syncWAL,undedicated) to (pre-syncWAL,dedicated) is less pronounced.

Overall the tunings help (+cc #17165).

          | undedicated,pre | undedicated,post | dedicated,pre  | dedicated,post
---------------------------------------------------------------------------------------
untuned   | 2063.8 ops/s    | 2358.0 ops/sec   | 2215.1 ops/sec | 2383.5 ops/sec
---------------------------------------------------------------------------------------
tuned     | 2252.6 ops/s    | 2390.7 ops/sec   | 2345.1 ops/sec | 2472.2 ops/sec
            (+9.15%)          (+2.76%)           (+5.87%)         (+3.72%)

Final column is the performance gain to be had with my first stab at tuning this in conjunction with the new engine, +3.72%.

TL;DR:

The perf gain this would have had without #16942 going in: +13.63% [Going from (pre-syncWAL,undedicated,untuned) to (pre-syncWAL,dedicated,tuned)]

The perf gain to be had now: +4.84% [Going from (syncWAL,undedicated,untuned) to (syncWAL,dedicated,tuned)]

petermattis commented 7 years ago

@irfansharif This write-up is awesome reading. Lots of good work here even though this change isn't landing in 1.1.

petermattis commented 6 years ago

Closing this as we're unlikely to revisit this approach. Storage of the Raft log remains a performance concern, but my current thinking is to either use something like RocksDB 2PC or to add a new mechanism to RocksDB for storage of write-once read-almost-never values.

ajwerner commented 2 years ago

It seems like we're actively doing this, right?

sumeerbhola commented 2 years ago

A design + interface was merged in https://github.com/cockroachdb/cockroach/pull/72795

sumeerbhola commented 2 years ago

The design mentioned in the previous comment has a WAL in both storage engines. The WAL for the state machine engine is very rarely synced. This is similar to the setup described in the second experiment from https://github.com/cockroachdb/cockroach/issues/38322#issuecomment-504856253 which demonstrated a doubling in throughput, with the following hypothesis "which is presumably in part because we no longer have Raft entry application getting mixed in the same WAL as the Raft log writes" -- i.e., since each fdatasync is syncing half the data volume, it is faster.

Due to the recent regression noticed when loosely coupled raft log truncation was enabled (see https://github.com/cockroachdb/cockroach/issues/78412#issuecomment-1119922463 and earlier comments by @erikgrinaker), we probably need a more thorough understanding of the write amplification effect of the various options.

The experiments here are done using a benchmark with a workload consisting of a set of ranges, for which it does a single MVCCBlindPut, adds that batch to a raft log (with sync), then modifies the batch to also update the RangeAppliedState and commits to the state machine (without sync). It cycles through all the ranges using a single thread -- this was done to ensure that the LSM(s) did not fall behind in compactions. Whenever a range has > 64KB in the raft log, a truncation is done, either in a tightly-coupled manner, i.e., the truncation is done immediately (without sync), or in a loosely-coupled manner where the next flushed callback from the state machine engine causes the truncation. Code is in https://github.com/sumeerbhola/cockroach/tree/sep_lsm.

Since there is no concurrency in the workload, there is no batching between calls to sync the WAL, so the throughput shouldn't be considered representative, but it serves as a way to compare alternatives. Also, most of the results here are about the write amplification (we discuss throughput at the end). The keys were 60 bytes and the values were 1000 bytes, and are completely random, so will not compress, which means the bytes in the WAL and in the sstables are quite comparable.

The benchmark is broadly used to compare 1eng-tight, 1eng-loose, 2eng-tight, 2eng-loose, with a few other knobs. Note that 2eng-tight and 1eng-loose can be considered odd:

We don't really need to do loosely coupled truncation for 1eng if we are sure that the state machine is properly synced (though there may be potential bugs, like https://github.com/cockroachdb/cockroach/issues/38566, which is why we tried enabling loosely-coupled truncation in v22.1).
2eng-tight should be considered to be "near-tight". That is, instead of waiting for the state machine to flush, we could do asynchronous syncing of the state machine's WAL and use that to inform what we can truncate. This is non-trivial but doable, and here we are simply exploring the performance gains of doing something like this.

These experiments were run on a gceworker with 100GB PD-ssd. The GCP documentation says these get 0.48MB/s/GB, so 48MB/s of throughput. Whether the provisioned bandwidth was saturated or not has no impact on the write amp numbers, but does have a bearing on the later throughput numbers. Measurements indicated we were using < 30MB/s, so not bandwidth saturated. All experiments set the maximum compaction concurrency to 8, and for 2eng this was split into 4 each for the 2 engines. Less than 1 CPU of the gceworker was being utilized by the benchmark, so there is no CPU saturation.

1. Experiments with 2 ranges

This is a favorable case for tightly-coupled truncation since each range will truncate after 64KB, which allows most truncations to happen while the original raft log entry is still in the memtable. We consider this case since even though a store has 1000s of ranges, it is possible that the write load is dominated by a handful of ranges, and the system should not regress in performance for such a case.

Summary: with the "large data volume" 2eng-loose was comparable to 1eng-tight, with fair memtable sizing across the 2 setups. This suggests that in the 1eng case there is more than expected write amplification caused by the modest amount of raft bytes that do get flushed to L0. This may need more investigation.

Experiment 1.1: 64MB memtable in all engines. Small data volume (1.1GB written to the WAL)

For the 2eng case, each WAL saw ~550MB of writes. For 2eng-tight, only 4.8MB was written to L0 for the raft-eng. So log truncation is ensuring that using an LSM to store the raft logs is very efficient -- all the write cost is in the WAL of the raft-eng. The following shows the summary numbers for each setup. The compaction number includes flush bytes. Write-amp is calculation as Compaction bytes/WAL bytes. The summary numbers are produced after a manual compaction compacts everything to L6.

1eng-loose: Write Bytes WAL: 1.1 GiB, Compaction: 5.9 GiB, W-Amp(C/W): 5.45
1eng-tight: Write Bytes WAL: 1.1 GiB, Compaction: 4.4 GiB, W-Amp(C/W): 4.06
2eng-loose: Write Bytes WAL: 1.1 GiB, Compaction: 4.3 GiB, W-Amp(C/W): 3.70
2eng-tight: Write Bytes WAL: 1.1 GiB, Compaction: 3.3 GiB, W-Amp(C/W): 2.82

We expect 1eng-tight to have lower write-amp than 1eng-loose and 2eng-tight to have lower write-amp than 2eng-loose, and the numbers are consistent with that expectation. This is because the loose ones need to wait for the raft log entries to flush before the corresponding state machine changes are durable, and then the truncation is applied.

The surprise was 2eng-loose having lower write-amp than 1eng-tight. Also, we would expect 1eng-tight to be the same as 2eng-tight given that only a tiny amount of the raft log made its way out of the memtable in the tight case. The suspicion was that the better behavior of 2eng is due to the unfair advantage that 2eng has in terms of having 2 64MB memtables, so they can hold more in-memory before flushing, which means fewer flushes and less compaction out of L0 etc.

Experiment 1.2: 64MB memtable in all engines. Larger data volume (21GB written to the WAL)

The larger data volume ensures there is ~10GB in the state machine at the end of the experiment, so levels L3-L6 are populated. The summary numbers:

1eng-loose: Write Bytes WAL: 21 GiB, Compaction: 211 GiB, W-Amp(C/W): 10.01
1eng-tight: Write Bytes WAL: 21 GiB, Compaction: 190 GiB, W-Amp(C/W): 9.00
2eng-loose: Write Bytes WAL: 21 GiB, Compaction: 157 GiB, W-Amp(C/W): 7.41
2eng-tight: Write Bytes WAL: 21 GiB, Compaction: 141 GiB, W-Amp(C/W): 6.68

Again we see both 2eng setups being better than 1eng-tight.

Experiment 1.3: 64MB memtables for 2eng, 128MB memtable for 1eng. Medium data volume (2.2GB written to the WAL)

1eng-loose: Write Bytes WAL: 2.2 GiB, Compaction: 11 GiB, W-Amp(C/W): 5.14
1eng-tight: Write Bytes WAL: 2.2 GiB, Compaction: 7.9 GiB, W-Amp(C/W): 3.63
2eng-loose: Write Bytes WAL: 2.2 GiB, Compaction: 9.6 GiB, W-Amp(C/W): 4.40
2eng-tight: Write Bytes WAL: 2.2 GiB, Compaction: 7.9 GiB, W-Amp(C/W): 3.60

Once we remove the unfair advantage of the memtable size, we see that 1eng-tight and 2eng-tight are comparable. Interestingly, 1eng-loose is worse than 2eng-loose, even though the number of truncations and the indexes that were truncated matched up. We noticed more bytes were written to L0 for 1eng-loose than 2eng-loose. We hypothesize that this difference is because there is a slight difference in when the sm-eng (state machine engine) memtable flushes in the 2eng case, which allows some raft log entries to be truncated before the raft-eng memtable flushes.

The remaining experiments all use this memtable size configuration.

Experiment 1.4: Large data volume (21GB written to the WAL)

This was done to see if the comparisons from the previous experiment held up with a larger data volume. NB: the following results are repeatable (the largest delta of write-amp in a different run was 0.03).

1eng-loose: Write Bytes WAL: 21 GiB, Compaction: 187 GiB, W-Amp(C/W): 8.84
1eng-tight: Write Bytes WAL: 21 GiB, Compaction: 157 GiB, W-Amp(C/W): 7.41
2eng-loose: Write Bytes WAL: 21 GiB, Compaction: 156 GiB, W-Amp(C/W): 7.41
2eng-tight: Write Bytes WAL: 21 GiB, Compaction: 142 GiB, W-Amp(C/W): 6.70

Interestingly, despite no unfairness in memtable sizing, 2eng-tight is better than 1eng-tight and 2eng-loose is comparable to 1eng-tight. Looking at 2eng-tight, we see the raft-eng has only 96MB being written to L0 (the rest is truncated before the flush):

__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
    WAL         1    44 M       -    11 G       -       -       -       -    11 G       -       -       -     1.0
      0         0     0 B    0.00    11 G     0 B       0     0 B       0    96 M     198     0 B       0     0.0

So ~1% of what was written to the memtable is not truncated at flush time, which is because doing truncation is not perfectly aligned with filling up of a memtable. Looking at the details of the LSM shape, we hypothesize that a similar 96MB being flushed to L0 in the 1eng case could have an outsize influence in later compactions from L0 to L3 and so on down to L6, since the 2eng case has another advantage: the file count target for L0 and the size targets for lower levels are shared for the 1eng case.

2. Experiments with 5000 ranges

This is designed to nullify the advantage of tightly-coupled truncation since 64KB truncation threshold * 5000 = ~320MB, which is much larger than the ~64MB of memtable capacity being used by raft entries. We expect loosely-coupled truncation to do well here. The main reason to run this is to understand the write amplification cost of storing the raft logs in an LSM, when most of the raft log bytes do get flushed to L0.

Experiment 2.1: Large data volume (21GB written to the WAL)

1eng-loose: Write Bytes WAL: 21 GiB, Compaction: 217 GiB, W-Amp(C/W): 10.27
1eng-tight: Write Bytes WAL: 21 GiB, Compaction: 205 GiB, W-Amp(C/W): 9.67
2eng-loose: Write Bytes WAL: 21 GiB, Compaction: 197 GiB, W-Amp(C/W): 9.32
2eng-tight: Write Bytes WAL: 21 GiB, Compaction: 190 GiB, W-Amp(C/W): 8.98

Compared to Experiment 1.4, the write-amp has increased significantly, ranging from 1.4 to 2.3. In the 1eng-tight and 2eng-tight cases, there is a 48GB increase in the compaction bandwidth due to the raft log being compacted down, which is ~25% of total compaction cost.
2eng-loose is better than 1eng-tight in this case since the tight truncation advantage has been reduced. Note that there is still some benefit with tightly-coupled truncation since the deletes will on average be written when the memtable is half full with raft log entries, and those raft log entries will not be written to the flushed sstables. Which is why the write-amp for 2eng-tight is lower than 2eng-loose, and 1eng-tight is lower than 1eng-loose.
- The compaction bandwidth difference of 7GB between 2eng-loose and 2eng-tight is all in the raft-eng. However, there is only a 0.2GB difference in what was flushed to L0. So there is 7GB/0.2GB = 35x multiplier to the extra bytes being written to L0, which is much higher than the write-amp (this needs some more thought).

Another experiment that used SingleDel to truncate the log in all these cases showed no improvement.

Ideas for improvement (if this is a realistic scenario):

If sstables flushed to L0 were split at range boundaries, repeated memtable flushes would not increase the sublevel count. However in this case this would result in 13KB sstables, which is too small. Also, L0CompactionFileThreshold defaults to 500, so a single flush with 5000 sstables would exceed that threshold.
We could tolerate higher read amp by increasing L0CompactionThreshold, which we set to 2 in CockroachDB. This may be ok since raftentry.Cache reduces the need to read back raft log entries.

One advantage of separating the engines is that the write cost of the raft log becomes very obvious and it becomes more realistic to work on optimizing it.

3. Throughput experiments

The observed throughput in the earlier experiments for the four alternatives was the same. To investigate, we looked at pebble.InternalIntervalMetrics, that provide information about the WAL writer throughput and the WAL sync latency distribution. This confirmed that the observed throughput is constrained by WAL throughput of the engine containing the raft log. Observations:

The WAL for the state machine engine, for 2eng, is only synced when the WAL file is closed. This sync can take ~30ms. By using Options.WALBytesPerSync=1MB we can reduce this sync latency to 7ms. Note that we cannot eliminate this sync at WAL file close since in a real setting there will be rare CockroachDB requested syncs of the state machine engine, and those syncs need to guarantee that preceding WAL files are also synced.
The WAL for the engine containing the raft log has an initial sync latency of ~2ms, which later reduces to ~0.7ms. This behavior is consistent across many runs, and across the 1eng and 2eng cases. The hypothesis is that this reduction happens after we start recycling WAL files, since they already have an appropriate length, and so fdatasync does not need to sync size metadata. Note that in 2eng, setting WALBytesPerSync=1MB on the state machine engine did not affect the observed sync latency of the raft engine.
Since the sync latency is the same across 1eng and 2eng, the WAL throughput for the raft engine in 2eng is half that of the shared engine in 1eng, which results in equal overall throughput. This is contradicting the higher throughput observed in https://github.com/cockroachdb/cockroach/issues/38322#issuecomment-504856253. This equal sync latency observation can be explained by the behavior of ext4, where an fdatasync flushes everything in the buffer cache and not just the data for the file being flushed (see the "fsync side-effects" section in https://lwn.net/Articles/842385/).
- TODO: the fast-commits optimization described in that article is from Oct 2020, while the filesystem on my gceworker was created in May 2020. The article says one has to recreate the filesystems to use this feature. Recreate and rerun throughput experiments. I could not get fast-commit to work, and it isn't clear that this feature is complete. Also it may not even be relevant for us given that we are recycling WALs.

@tbg @nvanbenschoten @erikgrinaker @cockroachdb/storage

erikgrinaker commented 2 years ago

Thanks for running these experiments @sumeerbhola! I think this just confirms that using an LSM for the Raft log is suboptimal: a write-amp of up to 10 seems pretty bad, and this work is mostly wasted since we're ideally only reading this data once.

Can we do any further optimizations to the Pebble engine used for the log to get this number down? E.g. reduce the number of levels, reduce the compaction aggressiveness, increase the memtable size, and so on? It would be good to get this down into the 2-4 range in the typical case under high write volume.

Longer-term, I think we should consider using a different storage engine -- ideally, we should be able to get this down to two write operations: commit and truncation. Any thoughts on this?

sumeerbhola commented 2 years ago

Playing around with the L0CompactionThreshold helps. The first LSM is using the default of 2 and the second one is using 10. With 10, the read amp stays <= 6, so I think this is a viable setting for the raft log.

Note that if the log is not going to be truncated soon and since we want quick lookup by index, one can't avoid flushing. So the 11GB in the WAL + 10GB of write to L0 probably can't be avoided (without a different indexing structure that directly indexes the WAL). On top of that we are paying 23GB. Increasing the L0CompactionThreshold further had diminishing returns, but I did not investigate in a lot of detail. Regarding using a different storage engine, I'm not very inclined towards it -- these things initially look simple but over time become a huge effort. If we start running with separated engines, and the raft engine with a higher L0CompactionThreshold, and then see that real production cases resemble this 5000 active ranges setup, I think we can come up with techniques to optimize Pebble for this use case (e.g. value separation like WiscKey, discussed in https://github.com/cockroachdb/pebble/issues/112, since the keys are tiny in this case).

L0CompactionThreshold=2
__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
    WAL         1   5.5 M       -    11 G       -       -       -       -    11 G       -       -       -     1.0
      0         0     0 B    0.00    11 G     0 B       0     0 B       0    10 G   5.3 K     0 B       0     1.0
      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      2         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      3         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      4         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      5        21    64 M    0.99    10 G     0 B       0    78 M      40    19 G   5.7 K    21 G       1     1.9
      6        32   184 M       -   8.6 G     0 B       0    34 K      12    26 G   4.3 K    35 G       1     3.1
  total        53   247 M       -    11 G     0 B       0    78 M      52    67 G    15 K    56 G       2     6.2
  flush       199
compact      4527     0 B     0 B       0          (size == estimated-debt, score = in-progress-bytes, in = num-in-progress)

L0CompactionThreshold=10
__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
    WAL         1   5.5 M       -    11 G       -       -       -       -    11 G       -       -       -     1.0
      0        58   106 M    0.60    11 G     0 B       0     0 B       0    10 G   6.0 K     0 B       3     1.0
      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      2         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      3         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      4         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      5        12    46 M    0.93    10 G     0 B       0     0 B       0   8.0 G   2.1 K    12 G       1     0.8
      6        17   107 M       -   5.9 G     0 B       0   194 M      49    15 G   2.7 K    21 G       1     2.6
  total        87   259 M       -    11 G     0 B       0   194 M      49    44 G    11 K    33 G       5     4.1
  flush       199
compact      1740   303 M     0 B       0          (size == estimated-debt, score = in-progress-bytes, in = num-in-progress)

erikgrinaker commented 2 years ago

I think we can come up with techniques to optimize Pebble for this use case

Yeah, that's our only option in the medium-term anyway, so we may as well see how far we can take it.

tbg commented 2 years ago

Thanks for all of the explanations, I was able to follow your write up (and with some looking at the experiment) code feel like I know what's going on and it makes sense.

The IO overload experiments I've been looking into have dwelled on w-amp a lot, too. Most participants (maybe not you, but definitely myself and @nicktrav) did not expect such large multiples (we're seeing 15x most of the time on larger clusters. I agree that to do anything here, we need to separate the raft log out first, so I hope we can get back to that soon.

Playing around with the L0CompactionThreshold helps. The first LSM is using the default of 2 and the second one is using 10. With 10, the read amp stays <= 6, so I think this is a viable setting for the raft log.

Reading this:

// The amount of L0 read-amplification necessary to trigger an L0 compaction.

and since in these experiments, I can't imagine there would be files that can share a sublevel (every SST overlaps every other SST), isn't this just the number of SSTs required until we compact out of L0? So 2 will trigger (or at least suggest) a compaction when the second SST enters L0. Shouldn't you necessarily see a read-amp of >10 if that setting is 10, then? Does it have to do with compactions creating sublevels that drops us below 10?

I also don't have a mental model of how the LSM shape evolves under experiment 2.1. We get mutually overlapping files in L0 that have "a little bit of everything" for each log. When we compact two of them, we are likely both growing and shrinking the result, since log truncations and log writes are mixed. Will the result go to Lbase+ either way or can things stay around in L0? If something makes it "out of L0", isn't it chased by a tombstone soon thereafter but does the tombstone even have a chance of catching up?

Other than the improvements you mentioned, Will multi-level compactions help here? Or maybe even multi-level compactions that scan the source file(s) selectively, only picking up deletion tombstones. That way, they can "thin out" the lower levels without actually transporting more data into a lower level. That somehow ties up with my earlier question about letting tombstones catch up with the values they delete. If tombstones got stored separately from the log entries, could we do this more efficiently? Somehow the problem is that we basically know that most entries are or will soon be deleted, but because everything gets jumbled up in L0 it becomes very hard to do anything clever. Is a PebblesDB-type guard (only for L0) something to consider here, where we split the SSTs by replica? This will create lots of small SSTs, but maybe that's fine since we have no operations that scan across SSTs? If we then also split SSTs into values and deletions upon flush, we should be able to pick compactions that take a deletion input file (we can use rangedels for raft log truncation) and all "older" SSTs and compact them together, which likely means removing all SSTs (ok, now we're assuming a singlerangedel, but otherwise the result is just an empty sst with a rangedel, also good). For example:

---------keys--------------------------->
     /| log 1       | log 2 | ... | log 50k |
L0  / | [3.put.sst] | ...
    \ | [2.del.sst] | ...
     \| [1.put.sst] | ...
----------------------------------------
L1 (no partitions)
      [sst with log entries that stuck around for a long time ]

Assuming all 50k "partitions" look like log1s, a compaction could pick, from each partition (and combining enough partitions to create a large enough output file), the newest deletion SST and all of the prior SSTs, knowing that a "compaction" basically means throwing all of them away in the common case. There sure are a lot of files, though, but most of them are just that... files, it's not like we need to touch them a lot and if we do, it will be on a per-log basis and with a strong bias towards the most recent files.

This all verges (probably deeply) into the "custom engine" territory (and has lots of problems I didn't think to consider), and I trust your intuition to put that off ~indefinitely. But it's immensely helpful for my intuitive and actual grasp of the topic to spitball these kinds of ideas.

petermattis commented 2 years ago

If we then also split SSTs into values and deletions upon flush, we should be able to pick compactions that take a deletion input file (we can use rangedels for raft log truncation) and all "older" SSTs and compact them together, which likely means removing all SSTs (ok, now we're assuming a singlerangedel, but otherwise the result is just an empty sst with a rangedel, also good).

FYI, delete-only compactions are an optimization which Pebble already implements. Splitting sstables on replica boundaries could blow up the number of sstables, but perhaps that is worthwhile if doing so could take advantage of delete-only compactions such that the old sstables are simply removed without ever being read for a compaction.

I prototyped a table partitioning API a long time ago. This would give similar functionality to PebblesDB-type guards. Very little work to do there, just didn't have an immediate use for it so it was never pushed over the finish line.

tbg commented 2 years ago

I used https://github.com/cockroachdb/cockroach/issues/82109 to get some w-amp numbers as well. See the slack thread for more:

Currently, to support 4mb/s of uniformly random write throughput (400 ops/sec of 10kb random data), we need to write to disks at around ~150mb/s, and read at ~60mb/s (these numbers go up slowly as time passes, they were read off around 11h in). The LSM w-amp on these nodes is ~14.2. That means that 150/(4*14.2) = 2.6 is the additional w-amp factor introduced by the fact that we write everything twice: once to the raft log, and then once to the state machine. Isn't that surprising? I thought we would eat at most an additional factor of 2x, but hopefully a little less (since some raft log entries might be deleted while still in the memtable).

I am curious whether 2.6 is at all expected here. I might be holding something wrong.

tbg commented 2 years ago

I ran some alternative computation using the reported total disk bytes written vs the total sql bytes received (source) and this came out to a "real end to end" w-amp of 15.8 vs the 14.2 reported by pebble, so a factor of 1.1 and change. I need to think more about what this actually means but it strikes me as unexpectedly favorable. We hand everything to pebble twice (once on append and once on apply), so the factor being small seems to imply that the first write (append entry to WAL) mostly doesn't need to get flushed to L0, i.e. it incurs a w-amp much smaller than the apply write, which will always traverse the entire LSM (since unlike the raft log entries it doesn't get deleted).

tbg commented 2 years ago

We have 450 replicas here, which probably explains it. The memtable is 64mb, so it takes ~6400 writes to fill up the memtable. That's 6400/450 = 14.2 writes per replica. 14.2*10k is 142k which is larger than the 64kb threshold at which we start log truncations.

So it's plausible that in this workload, most raft entries are deleted in-place (though the deletions also add to the memtable, so we would have to divide the 6400 by half but then it still fits) before getting flushed.

nvanbenschoten commented 2 years ago

Nice experiments and writeup @sumeerbhola! The note about the "fsync side-effects" section in https://lwn.net/Articles/842385/ and then the experimental results to back it up are especially interesting.

In light of that article, one thing that I don't understand is why the syncs for the two WALs in the 2eng experiments had significantly different latency profiles. Also, if an fsync on one file has the effect of syncing all files, why was WALBytesPerSync needed for the state machine WAL but not needed for the raft log WAL? Wouldn't we expect the frequent syncing of the raft log WAL to obviate the need for WALBytesPerSync?

Another experiment that used SingleDel to truncate the log in all these cases showed no improvement.

Does this make sense, given that we don't see the raft engine's LSM expand beyond a single level? Or did we see the raft engine's LSM grow to multiple levels and still not see an improvement using SingleDel?

Either way, it's worth mentioning those results in https://github.com/cockroachdb/cockroach/issues/8979.

sumeerbhola commented 2 years ago

In light of that article, one thing that I don't understand is why the syncs for the two WALs in the 2eng experiments had significantly different latency profiles.

I don't have an answer. I may be interpreting that article incorrectly -- maybe the "sync everything" behavior is limited to the journal. There are old threads like this one https://groups.google.com/g/fa.linux.kernel/c/ogRJPoaso9g/m/ysmSK69ylS8J, which also seem related but I can't quite make sense of them.

petermattis commented 2 years ago

Can't we test to see if the "sync everything" behavior is actually taking place and under what circumstances? Create two files of dirty data and call fsync and fdatasync on only one of them. Using the Linux I/O Profiling tools to see what is being written out at the block layer. Note that I'd want to see if fsync differs from fdatasync. Also worth testing sync_file_range (which is what we use for WALBytesPerSync).

blathers-crl[bot] commented 2 years ago

cc @cockroachdb/replication

blathers-crl[bot] commented 1 year ago

cc @cockroachdb/replication

sumeerbhola commented 1 year ago

@andrewbaptist suggested removing the fsync interference caused by sharing the filesystem by instead having 2 filesystems on the same block device. There are two options for this filesystem split:

different ones for the raft log and state machine: The downside is fragmenting the disk capacity, since even we though we know that the raft logs are small, we may not have a good upper bound across all the replicas.
one for the 2 WALs and one for the sstables etc. for the 2 engines: Given that we have a fixed maximum backlog of unflushed memtables before suffering a write stall, we can put a bound on the capacity needed for the WALs.

andrewbaptist commented 1 year ago

A slightly different alternative is to not use fsync on either WAL and instead write them using AIO/O_DIRECT. If we go down that path, we would want to preallocate the WAL and then treat it as a circular buffer. Using the O_DIRECT path would mean we would need to be more explicit about our writes (need to be 4K aligned), but if we do that, then we also get some of the benefits of having an async interface to the storage layer since only the WAL write is in the critical path. This could also be done with io_uring, but I don't know a lot about that...

petermattis commented 1 year ago

The Pebble WAL is not pre-allocated (well it is, but not in the way that you're describing), but the WAL recycling functionality means it is more or less a circular buffer when operating in the steady state. We don't seem to have a good description of WAL recycling in a comment or design doc, but @jbowens or @sumeerbhola know where the bodies are buried (i.e. how this all works).

cockroachdb / cockroach