Open nvanbenschoten opened 5 years ago
Like mentioned earlier in Slack, this sounds really promising. Do you have any intuition on what benchmark needles this would move, and by how much?
I really like the idea of using the raft log as the WAL, and decoupling it as much as possible from writing of data pages. I'm seeing a number of systems do something similar (ex. https://www.microsoft.com/en-us/research/uploads/prod/2019/05/socrates.pdf). The idea is that the WAL gets stored to a "log service", which makes it durable and then applies that in distributed fashion to various page caches and stores. In Socrates, there are 3 levels of data pages. When a data page is needed, the local cache (usually on local SSD) is checked. If the page is not there, then the much larger (but slower) page servers are consulted. If those don't have it, then the page is fetched from super-cheap, ultra-high-capacity (but higher latency) blob storage.
Assuming we eventually scale up into 100's of TBs or petabytes of data, we may want to consider similar ideas. They also allow for more efficient use of EBS, since currently when we use EBS we effectively store 9 copies of the data (3 CRDB replicas * 3 EBS replicas of the data). Having a clean, well-defined separation between log storage and data storage will be essential to opening up these and other future possibilities.
@nvanbenschoten Raft log entries and applied commands are not the only data we write to RocksDB. For example, gossip stores bootstrap info in the local RocksDB instance. I think we'd want these writes to have the same guarantees as they currently do. You've probably already thought about this, but I think it should be explicit here: the only RocksDB writes which would avoid the WAL are those for Raft command application.
We'll need to check the behavior of RocksDB with regards to mixing writes which use the WAL and skip the WAL in the same instance. In Pebble, this will work as you'd expect, because the Pebble commit pipeline writes each batch to the WAL individually, while RocksDB tries to combine batches. I recall there was some special code in the RocksDB commit path to handle this scenario, but can't recall the details. And our grouping of batches in storage/engine
might negate any RocksDB special handling. Cc @ajkr
Do you have any intuition on what benchmark needles this would move, and by how much?
It's tough to say for sure, but we can use rafttoy to get an idea. I ran rafttoy with its parallel-append
pipeline, which most closely resembles Cockroach as it is today (plus https://github.com/cockroachdb/cockroach/issues/37426) and posted some numbers below. In the first set of numbers, I'm comparing a configuration with a single Pebble instance for the Raft log and data storage against a configuration with two Pebble instances, one for the Raft log and one for data storage. Nothing is changed with the WALs. This latter approach is essentially what was proposed and prototyped in https://github.com/cockroachdb/cockroach/issues/7807. We see that at peak throughput levels, the split instances roughly doubles throughput, which is presumably in part because we no longer have Raft entry application getting mixed in the same WAL as the Raft log writes.
name old time/op new time/op delta
Raft/conc=32/bytes=256-16 60.0µs ± 9% 69.7µs ± 8% +16.23% (p=0.000 n=15+15)
Raft/conc=64/bytes=256-16 20.9µs ± 5% 28.5µs ± 1% +36.23% (p=0.000 n=15+11)
Raft/conc=128/bytes=256-16 12.2µs ± 2% 13.6µs ± 5% +11.09% (p=0.000 n=12+15)
Raft/conc=256/bytes=256-16 10.9µs ± 1% 8.8µs ± 8% -18.52% (p=0.000 n=11+15)
Raft/conc=512/bytes=256-16 11.1µs ± 2% 7.0µs ± 8% -37.57% (p=0.000 n=12+15)
Raft/conc=1024/bytes=256-16 10.0µs ±16% 5.8µs ±10% -41.91% (p=0.000 n=15+15)
Raft/conc=2048/bytes=256-16 10.3µs ±21% 5.2µs ± 5% -49.96% (p=0.000 n=14+14)
Raft/conc=4096/bytes=256-16 10.9µs ± 7% 5.2µs ± 4% -52.40% (p=0.000 n=13+14)
Raft/conc=8192/bytes=256-16 9.65µs ±23% 6.27µs ± 1% -34.98% (p=0.000 n=15+12)
Raft/conc=16384/bytes=256-16 11.2µs ± 8% 10.1µs ± 3% -9.59% (p=0.000 n=15+15)
name old speed new speed delta
Raft/conc=32/bytes=256-16 4.27MB/s ± 8% 3.68MB/s ± 8% -13.92% (p=0.000 n=15+15)
Raft/conc=64/bytes=256-16 12.3MB/s ± 5% 9.0MB/s ± 1% -26.55% (p=0.000 n=15+12)
Raft/conc=128/bytes=256-16 20.9MB/s ± 2% 18.9MB/s ± 5% -9.90% (p=0.000 n=12+15)
Raft/conc=256/bytes=256-16 23.6MB/s ± 1% 29.0MB/s ± 9% +22.91% (p=0.000 n=11+15)
Raft/conc=512/bytes=256-16 23.0MB/s ± 2% 36.9MB/s ± 8% +60.34% (p=0.000 n=12+15)
Raft/conc=1024/bytes=256-16 25.8MB/s ±18% 44.0MB/s ± 9% +70.72% (p=0.000 n=15+15)
Raft/conc=2048/bytes=256-16 24.0MB/s ± 9% 49.4MB/s ± 5% +105.45% (p=0.000 n=12+13)
Raft/conc=4096/bytes=256-16 23.5MB/s ± 7% 49.3MB/s ± 4% +109.87% (p=0.000 n=13+14)
Raft/conc=8192/bytes=256-16 27.1MB/s ±25% 40.8MB/s ± 1% +50.36% (p=0.000 n=15+12)
Raft/conc=16384/bytes=256-16 22.9MB/s ± 8% 25.2MB/s ± 3% +10.45% (p=0.000 n=15+15)
In the second set of numbers, I'm comparing a configuration with single Pebble instance for the Raft log and data storage against a configuration with two Pebble instances, one for the Raft log and one for data storage. In this scenario, the Pebble instance used only for data storage has its WAL disabled, so Raft log application is not writing to a WAL at all. We again see that at peak throughput levels the split instances roughly doubles throughput.
name old time/op new time/op delta
Raft/conc=32/bytes=256-16 60.0µs ± 9% 67.1µs ± 5% +11.80% (p=0.000 n=15+15)
Raft/conc=64/bytes=256-16 20.9µs ± 5% 27.9µs ± 4% +33.36% (p=0.000 n=15+15)
Raft/conc=128/bytes=256-16 12.2µs ± 2% 12.8µs ± 7% +4.66% (p=0.003 n=12+15)
Raft/conc=256/bytes=256-16 10.9µs ± 1% 8.1µs ± 2% -25.73% (p=0.000 n=11+11)
Raft/conc=512/bytes=256-16 11.1µs ± 2% 6.0µs ±14% -45.96% (p=0.000 n=12+15)
Raft/conc=1024/bytes=256-16 10.0µs ±16% 5.0µs ± 5% -50.18% (p=0.000 n=15+15)
Raft/conc=2048/bytes=256-16 10.3µs ±21% 5.7µs ±25% -44.45% (p=0.000 n=14+15)
Raft/conc=4096/bytes=256-16 10.9µs ± 7% 5.5µs ±26% -49.97% (p=0.000 n=13+15)
Raft/conc=8192/bytes=256-16 9.65µs ±23% 6.03µs ± 4% -37.47% (p=0.000 n=15+15)
Raft/conc=16384/bytes=256-16 11.2µs ± 8% 9.4µs ± 2% -16.18% (p=0.000 n=15+15)
name old speed new speed delta
Raft/conc=32/bytes=256-16 4.27MB/s ± 8% 3.82MB/s ± 5% -10.62% (p=0.000 n=15+15)
Raft/conc=64/bytes=256-16 12.3MB/s ± 5% 9.2MB/s ± 4% -25.02% (p=0.000 n=15+15)
Raft/conc=128/bytes=256-16 20.9MB/s ± 2% 20.0MB/s ± 7% -4.34% (p=0.004 n=12+15)
Raft/conc=256/bytes=256-16 23.6MB/s ± 1% 31.7MB/s ± 2% +34.65% (p=0.000 n=11+11)
Raft/conc=512/bytes=256-16 23.0MB/s ± 2% 42.7MB/s ±13% +85.75% (p=0.000 n=12+15)
Raft/conc=1024/bytes=256-16 25.8MB/s ±18% 51.2MB/s ± 5% +98.87% (p=0.000 n=15+15)
Raft/conc=2048/bytes=256-16 24.0MB/s ± 9% 46.0MB/s ±29% +91.39% (p=0.000 n=12+15)
Raft/conc=4096/bytes=256-16 23.5MB/s ± 7% 48.1MB/s ±22% +104.49% (p=0.000 n=13+15)
Raft/conc=8192/bytes=256-16 27.1MB/s ±25% 42.5MB/s ± 4% +56.43% (p=0.000 n=15+15)
Raft/conc=16384/bytes=256-16 22.9MB/s ± 8% 27.2MB/s ± 2% +19.11% (p=0.000 n=15+15)
Aggregated together, we see that the maximum throughputs of each configuration are:
merged instance = 27.1MB/s
split instance + data WAL = 49.4MB/s
split instance + no data WAL = 51.2MB/s
Note: I'm not sure what to make of the lower concurrencies being better with a merged Pebble instance. I wonder if we're so far away from saturation at these concurrencies that the extra WAL writes are somehow smoothing syncs out. Perhaps @petermattis has some intuition here.
I really like the idea of using the raft log as the WAL, and decoupling it as much as possible from writing of data pages. I'm seeing a number of systems do something similar...
I agree. Even if we don't move to distinct log and data "services" like Microsoft does with Socrates in that paper, keeping the abstractions separate with a clear boundary between them opens up a lot of flexibility for us to place different constraints on those components. @andreimatei has been thinking in this area recently as well.
Raft log entries and applied commands are not the only data we write to RocksDB. For example, gossip stores bootstrap info in the local RocksDB instance. I think we'd want these writes to have the same guarantees as they currently do. You've probably already thought about this, but I think it should be explicit here: the only RocksDB writes which would avoid the WAL are those for Raft command application.
I hadn't thought about this yet, but you're right. We'll still need a WAL for the data engine and we will need to be careful about which writes to the data engine can skip the WAL (i.e. Raft entry application) and which will still need to go in the WAL. This may require us to re-evaluate the level of durability we demand of various writes throughout our system, which I expect to be a healthy exercise. I've always been a little nervous that the rate at which we sync our single RocksDB WAL might be good at hiding areas where we're not clear enough about durability requirements necessary for correctness. The subtlety of this in areas like https://github.com/etcd-io/etcd/issues/7625#issuecomment-489232411 shines a light on how tricky this is, though perhaps we're already very conservative (i.e. always sync) and the fear is unfounded.
I put together a prototype that sets WriteOptions::disableWAL
to true during Raft entry application. This would potentially allow us to address this issue without the need for a separate storage engine for the Raft log. I verified that the change appeared to work by monitoring the size of the WAL for a given workload before and after the change and noticing about a 50% reduction in traffic.
The prototype gave a noticeable win, but it was less pronounced than I was hoping for:
name old ops/sec new ops/sec delta
kv0/size=1/cores=16/nodes=3 29.9k ± 0% 31.1k ± 0% +3.89% (p=0.008 n=5+5)
kv0/size=256/cores=16/nodes=3 28.0k ± 1% 28.7k ± 0% +2.45% (p=0.008 n=5+5)
kv0/size=8192/cores=16/nodes=3 3.30k ± 2% 3.49k ± 1% +5.65% (p=0.008 n=5+5)
name old p50(ms) new p50(ms) delta
kv0/size=1/cores=16/nodes=3 1.50 ± 0% 1.40 ± 0% -6.67% (p=0.008 n=5+5)
kv0/size=256/cores=16/nodes=3 1.60 ± 0% 1.50 ± 0% -6.25% (p=0.008 n=5+5)
kv0/size=8192/cores=16/nodes=3 10.0 ± 0% 8.9 ± 0% -11.00% (p=0.029 n=4+4)
name old p99(ms) new p99(ms) delta
kv0/size=1/cores=16/nodes=3 4.70 ± 0% 4.70 ± 0% ~ (all equal)
kv0/size=256/cores=16/nodes=3 5.20 ± 0% 5.20 ± 0% ~ (all equal)
kv0/size=8192/cores=16/nodes=3 71.3 ± 6% 65.0 ± 0% -8.84% (p=0.016 n=5+4)
This required me to split the engine.rocksDBBatch
pipeline into a with-WAL group and a without-WAL group for the purposes of batching, so it's possible I messed something up there or that the loss in batching is counteracting the reduction in write amplification. I'll run the numbers again with the split groups but no use of WriteOptions::disableWAL
to isolate the change.
I also plan to rebase the prototype on top of https://github.com/cockroachdb/cockroach/pull/38568 and re-run the numbers once that's merged.
The prototype gave a noticeable win, but it was less pronounced than I was hoping for
Hmm, those numbers are not nearly as compelling as what you were seeing before. I recall you mentioning in person that using an actual separate RocksDB instance for the Raft log appeared to give a significant win for rafttoy.
This required me to split the engine.rocksDBBatch pipeline into a with-WAL group and a without-WAL group for the purposes of batching, so it's possible I messed something up there or that the loss in batching is counteracting the reduction in write amplification. I'll run the numbers again with the split groups but no use of WriteOptions::disableWAL to isolate the change.
I seem to recall that the RocksDB commit pipeline internally has some funkiness when batches have different options. It is possible that the source of the problem is there. And it is possible that using a single RocksDB instance is the problem itself. The WAL-skipping batches still need to wait for the previous WAL-writing batches to sync. Now that I think about it, that seems likely to be the problem.
I seem to recall that the RocksDB commit pipeline internally has some funkiness when batches have different options. It is possible that the source of the problem is there. And it is possible that using a single RocksDB instance is the problem itself. The WAL-skipping batches still need to wait for the previous WAL-writing batches to sync. Now that I think about it, that seems likely to be the problem.
I talked to @ajkr about this and he said that it's likely that RocksDB is batching the two groups together under the hood. He's going to look into whether there's an easy way to avoid this. Perhaps we can disable this batching entirely since we're already doing it ourselves?
If not, this change may require the raft log / data store split. Even with that split, we'll need to be careful about this effect, given your point in https://github.com/cockroachdb/cockroach/issues/38322#issuecomment-504462011 about some writes still requiring durability in the data store engine.
I talked to @ajkr about this and he said that it's likely that RocksDB is batching the two groups together under the hood. He's going to look into whether there's an easy way to avoid this. Perhaps we can disable this batching entirely since we're already doing it ourselves?
Yeah, they might be grouped into the same uber-batch. But even without that grouping there is a problem. Consider what happens if there is a commit of batch A that writes to the WAL and a concurrent commit of batch B that skips the WAL. If the commit of B happens after the commit of A it will get queued up behind A and require waiting for A to be written to the WAL and then applied to the memtable. The sync of the WAL itself will happen afterwards, though. Hmm, I'm not sure if this is a problem or not.
The above prototype focused on qps performance numbers, but could the more important dimension here be the IOPS? If the WAL is essentially off for the data engine it would only be flushing its memtable periodically. Roughly speaking now we write every bit three times: once into the raft log wal (and never out to an SST, since it's hopefully tombstoned before it gets there, though we then need to flush out the tombstone - I'm going to ignore this, we have thought about SingleDelete to avoid this), once into the data wal, and once into a (data) SST. Shouldn't average IOPS go down by around 33% with the prototype?
My mental model is a little unclear here about how extra writes to the WAL map to extra IOPS. Certainly if we were syncing twice as much, we would expect double the number of writes, but we're not. When @ajkr took a look at why we weren't seeing the expected improvement here, he found that the extra WAL writes were bloating the amount we wrote to the WAL as expected, but that we were writing so little between syncs that the extra writes weren't causing us to spill into extra SSD pages, so the number of writes from the SSD's perspective remained constant.
There's definitely some experimentation to be done here.
I wonder whether we should go a step further and in cloud deployments move the raft logs of a node to a separate EBS volume (EBS is virtualized storage, and is already charged in a fine-grained manner, so there shouldn't be a cost increase).
This would help with "repaving" by transferring the EBS volume corresponding to the state machine, which is where most of the data volume is. Such a transfer would not reduce resiliency since the old node would still have a raft log and can participate in the quorum (it just can't apply since it has no state machine).
I wonder whether we should go a step further and in cloud deployments move the raft logs of a node to a separate EBS volume (EBS is virtualized storage, and is already charged in a fine-grained manner, so there shouldn't be a cost increase).
This would help with "repaving" by transferring the EBS volume corresponding to the state machine, which is where most of the data volume is. Such a transfer would not reduce resiliency since the old node would still have a raft log and can participate in the quorum (it just can't apply since it has no state machine).
Interesting idea. I think the biggest blocker to acheiving this in cockroach is the fact that we share the storage engine containing the state machine for many ranges in a single store (LSM and EBS volume). If we had a separate LSM per range, this sort of thing becomes more feasible. It does carry other questions involve safe log truncation and snapshots.
While I'm here, I'd like to raise a point that has come up offline a number of times. Using a separate volume for applied state might make new hardware configurations reasonable. In particular, it might be reasonable to use a single, very fast storage device for the raft logs and then apply entries onto HDDs. In a cloud setting this might look like a small, but heavily IOP provisioned EBS volume. Server configurations which lots of large disks and a single, smallish SSD are not uncommon for cool-cold storage applications. Such a server is now rather unadvisable with cockroach. Perhaps with a separate storage engine, one could achieve reasonably good performance for write-mostly workloads at a dramatically lower cost per byte.
I think the biggest blocker to acheiving this in cockroach is the fact that we share the storage engine containing the state machine for many ranges in a single store (LSM and EBS volume). If we had a separate LSM per range, this sort of thing becomes more feasible.
Not clear to me why we need an LSM per range. I was specifically thinking of the repaving/decommissioning use case, where all the ranges are being moved to a new node (the latest number for a 10TB node decommission was 12 hours, I think). Am I missing something?
If we also consider rebalancing, the following is a copy-paste of an idea I had mentioned to the storage folks a few months ago
I was thinking again about the storage density issue as it relates to decommissioning and rebalancing a node with 10TB of data. That is, how we do avoid the cost of copying all that data around. It seems unlikely that cloud blob storage, which allows multiple readers/writers, will have a good enough SLA any time in the future.
EBS is virtualized storage, and is already charged in a fine-grained manner, so there isn’t a cost benefit of a 10TB EBS versus a 50GB EBS. An EBS that is too small will cause too much internal fragmentation and inter-EBS load-balancing. So say we had 10 1TB EBSs/stores per node. Load based rebalancing and decommissioning would just move a 1TB store to a different node, which would usually just cause a 10% increase in the accepting node (if 10% is too much over-provisioning we could further reduce the store size, but there is a tradeoff). If byte sizes of ranges are relatively stable, there wouldn’t be too much inter-store rebalancing. And we could be lax and allow stores to have cpu consumption more than 10% for a while as long as we can find a node to host it. If that load spike turns out not to be transient we would eventually resort to inter-store rebalancing.
One insight we can steal from this issue that may also be helpful for placing the state machine in disaggregated storage is that the state machine itself stores the Raft applied index of each Range in the RangeAppliedStateKey
. Since all updates to the state machine are atomic, we can always take an arbitrarily stale snapshot of the state machine and determine which log entries need to be applied to it to catch it up, assuming our log hasn't been truncated past this point.
In a way, this is saying the same thing as the rest of this issue. Only, this issue is describing how a replica's local LSM's memtable could be ephemeral. The extension here is expanding this to a replica whose entire local LSM is ephemeral, assuming we have some other durable store backing a consistent prefix of it.
Interestingly, such an architecture would motivate replicated log truncation again, because log truncation would be a function of the single durable state machine.
Of course, this leaves out all the hard details.
Not clear to me why we need an LSM per range.
An LSM per range makes these kinds of things easier because we can associate a set of files with exactly one Range. So it becomes more straightforward to dump a Range's state machine to something like S3. Or to pull a stale snapshot of a Range in from S3.
As much as I'm a fan of these offhand discussions about sticking the applied state of a range in some external storage system, it feels quite premature and speculative. Maybe there are places we want to take cockroach where pursuing such a strategy makes sense but there will have to likely be a clear and tactical motivation for such a change. Either way, better isolating the raft state and applied state seems beneficial along that path as well as supporting a good many other goals.
Here's the way I'd break it down:
Soon:
Then, speculatively:
I think this may lead to extra overhead for io when make raft log and wal together into a log engine, it may lead to io problems unless it is a log engine shared by all raft groups. Here is a benchmark: tested on a 7200 rpm hdd
not share log engine:
wkB/s
7292.00
3620.00
8844.00
440.00
9464.00
7452.00
6916.00
7444.00
6292.00
104.00
4716.00
5760.00
share log engine:
wkB/s
1024.00
25112.00
35140.00
35144.00
35136.00
964.00
35140.00
28.00
35144.00
30136.00
35188.00
not share log engine code:
package main
import (
"fmt"
"os"
)
const (
Loop = 100
Batch = 10000
)
var data []byte
var fs []*os.File
func init() {
data = make([]byte, 512)
fs = make([]*os.File, 10)
for i := 0; i < 10; i++ {
fs[i], _ = os.Create(fmt.Sprintf("%v.dat", i))
}
}
func main() {
for i := 0; i < Loop; i++ {
write()
}
}
func write() {
for i := 0; i < Batch; i++ {
fs[i%10].Write(data)
}
for _, f := range fs {
f.Sync()
}
}
share log engine code:
package main
import (
"fmt"
"os"
)
const (
Loop = 100
Batch = 10000
)
var data []byte
var fs []*os.File
func init() {
data = make([]byte, 512)
fs = make([]*os.File, 2)
for i := 0; i < 2; i++ {
fs[i], _ = os.Create(fmt.Sprintf("%v.dat", i))
}
}
func main() {
for i := 0; i < Loop; i++ {
write()
}
}
func write() {
for i := 0; i < Batch; i++ {
fs[0].Write(data)
fs[1].Write(data)
}
fs[0].Sync()
}
unless it is a log engine shared by all raft groups.
That is the plan. One engine for the raft logs, one for the applied state.
unless it is a log engine shared by all raft groups.
That is the plan. One engine for the raft logs, one for the applied state.
What's the design proposal for sharing the raft logs? Although adopting key-value engines such as pebble to store the separate raft log could be easy, while it would introduce redundant IO since pebble has its own WAL,too. Designing a dedicate log store for shariing raft logs is non-trivial, given no redundant IO permitted.
To start, we plan to store the logs directly in a Pebble engine, using Set operations when appending and SingleDelete operations when truncating. The thought is that Pebble's WAL will be the primary durable store of the Raft logs, and that its LSM will be kept quite shallow and relegated to raft groups with delayed log truncation. So in that sense, we don't expect (but will need to experiment with!) the redundant IO being too detrimental.
We have also floated ideas of collapsing Set and SingleDelete operations for the same key directly in the LSM memtable, either by using a mutable memtable or through a pre-L0 memtable compaction mechanism. If Set and SingleDelete can be collapsed before they reach L0 of the LSM, then with frequent enough log truncation, the write amplification would actually be optimal.
unless it is a log engine shared by all raft groups.
That is the plan. One engine for the raft logs, one for the applied state.
Oh, because I see yugabyte's design is separate, I feel it's design will lead to io problems. So I asked after I saw that crdb also had the idea of merging wal and log.If it is done based on pebble's wal, I think it will not cause io problems in theory.
We have also floated ideas of collapsing Set and SingleDelete operations for the same key directly in the LSM memtable, either by using a mutable memtable or through a pre-L0 memtable compaction mechanism. If Set and SingleDelete can be collapsed before they reach L0 of the LSM, then with frequent enough log truncation, the write amplification would actually be optimal.
It seems that the latter approach is convincible, when could it be expected to see?
Probably not anytime soon. Like Andrew mentioned above, we're likely better served first exploring #17500, which also isn't scheduled to happen in the near term (~6-ish months or so).
cc @cockroachdb/replication
We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!
CockroachDB currently stores Raft log entries and data in the same RocksDB storage engine. Both log entries and their applied state are written to the RocksDB WAL, even though only log entries are required to be durable before acknowledging writes. We use this fact to avoid a call to fdatasync when applying Raft log entries, opting to only do so when appending them to a Replica's Raft log. https://github.com/cockroachdb/cockroach/issues/17500 suggests that we could even go further and delay Raft log application significantly, moving it to an asynchronous process.
Problems
This approach has two main performance problems:
Raft Log Entries + Data in a shared WAL
Raft log appends are put in the same WAL as written (but not fsync-ed) applied entry state. This means that even though applied entry state in the WAL isn't immediately made durable, it is synced almost immediately by anyone appending log entries. This slows down the process of appending to the Raft log because doing so often needs to flush more of the WAL than it's intending to.
https://github.com/cockroachdb/cockroach/issues/7807 observed that we could move the Raft log to a separate storage engine and that doing so would avoid this issue. The Raft log would use its own write-ahead log and would get tighter control over the amount of data that is flushed when it calls fdatasync. That change also hints at the use of a specialized storage engine for Raft log entries. The unmerged implementation ended up using RocksDB, but it could have opted for something else.
Data written to a WAL
Past the issue with applied state being written to the same WAL as the Raft log, it is also a problem that applied state is written to a WAL, to begin with. A write-ahead log is meant to provide control over durability and atomicity of writes, but the Raft log already serves this purpose by durably and atomically recording the writes that will soon be applied. Paying for twice the durability has a cost - write amplification. A piece of data is often written to the Raft log WAL, the Raft log LSM (if not deleted in the memtable, see https://github.com/cockroachdb/cockroach/issues/8979#issuecomment-414015919), the data engine WAL, and the data engine LSM. This extra write amplification reduces the overall write throughput that the system can support. Ideally, a piece of data would only be written to the Raft log WAL and the data engine LSM.
Proposed Solution
One fairly elegant solution to address both of these concerns is to move the Raft log to a separate storage engine and to disable the data engine's WAL. Doing so on its own almost works, but it runs into issues with crash recovery and with Raft log truncation.
Both of these concerns are due to the same root cause - naively, this approach makes it hard to know when applied entry state has become durable. This is hard to determine because applied entry state skips any WAL and is added only to RocksDB's memtable. To determine when this data becomes durable, we would need to know when data is compacted from the memtable to L0 in the LSM. Furthermore, we often would like to know which Raft log entries this now-durable data corresponds to, which is difficult to answer. For instance, to determine how much of the Raft log of a Replica we can safely truncate, we'd like to ask the question: "what is the highest applied index for this Replica that is durably applied?".
The solution here is to use the
RangeAppliedStateKey
, which is updated when each log entry is applied (or at least in the same batch of entries, see https://github.com/cockroachdb/cockroach/issues/37426#issuecomment-502123133). This key contains the RaftAppliedIndex, which in a sense allows us to map a RocksDB sequence number to a Raft log index. We then ask the question: "what is the durable state of this key?". To answer this question, we can query this key using theReadTier::kPersistedTier
option inReadOptions.read_tier
. This instructs the lookup to skip the memtable and only read from the LSM, which is the exact set of semantics that we want.Using this technique, we can then ask the question about the highest applied Raft log index for a particular Replica. We can then use this, in combination with recent improvements in https://github.com/cockroachdb/cockroach/pull/34660, to perform Raft log truncation only up to durable applied log indexes. To do this, the
raftLogQueue
on a Replica would simply query theRangeAppliedState.raft_applied_index
usingReadTier::kPersistedTier
and bound the index it can truncate to up to this value. Similarly, crash recovery could use the same strategy to know where it needs to reapply from. In the second case, explicitly usingReadTier::kPersistedTier
might not be necessary because the memtable updates from before the crash will already be gone and the memtable will be empty.For all of this to work, we need to ensure that we maintain the property that the application of a Raft log entry is always in the same RocksDB batch as the corresponding update to the
RangeAppliedStateKey
and that RockDB will always guarantee that these updates will atomically move from the memtable to the LSM. These properties should be true currently.The solution fixes both of the performance problems listed above and should increase write throughput in CockroachDB. This solution also opens CockroachDB up to future flexibility with the Raft log storage engine. Even if it was initially moved to another RocksDB instance, it could be specialized and iterated on in isolation going forward.
Jira issue: CRDB-5635
Epic CRDB-40197