Open tbg opened 2 years ago
cc @cockroachdb/replication
+1 to the premise of this issue, thanks for writing this up.
An interesting observation that came out of our earlier discussion about this is that quorum replication can minimize the effect of fsync latency on writes (since raft commit and state machine application would see the K-th lowest fsync latency where K is the quorum size). However, a slow disk on the leaseholder will not be hidden from a reader attempting to read from that disk (after a block/page cache miss). For that, some form of hedging seems necessary. Strong consistency, even with consistent follower reads, complicates our ability to hedge reads. I don't understand the internals of cloud block storage well enough to know whether variable write latency is more of a problem than variable read latency.
It's also worth pointing out in this issue that (as of https://github.com/cockroachdb/cockroach/pull/38954) we don't wait for Raft entry application before acknowledging a write to the client. However, we do have to wait for the application before releasing that write's latches. As a result, any latency incurred by log application will not be paid by the writer itself, but will be paid by any reader or writer that contends with the writer.
However, a slow disk on the leaseholder will not be hidden from a reader attempting to read from that disk (after a block/page cache miss)
Yes, that's a concern, too. I would suspect that write throughput suffers more frequently than read throughput (since the latter is easier to scale), but both scenarios exist.
I'm not aware of great hedges here. There is always the option of transferring the lease (when there is a suitable target) but that is a very binary option and it's hard to determine the right moment to do so.
It's also worth pointing out in this issue that (as of https://github.com/cockroachdb/cockroach/pull/38954) we don't wait for Raft entry application before acknowledging a write to the client. However, we do have to wait for the application before releasing that write's latches. As a result, any latency incurred by log application will not be paid by the writer itself, but will be paid by any reader or writer that contends with the writer.
Just to spell this out for myself, in the ideal case, if we have a leader with a slow disk, and we've done all of the work in #17500 with an eye on this, a writer could come in, the write would commit with two followers, the append to the leader would be stuck, but the leader would learn that this entry committed and this would be enough to signal the waiting writer.
What's more, the leader doesn't have to wait for entries to be durable locally before it applies them (let's ignore conf changes and the like, where we can be stricter), assuming they got committed using other voters. The only gotcha is that if the leader crashes, and we have separate engines (= no ordering constraint between log write and state machine apply), we might come up with an AppliedIndex
that's ahead of what's actually in the logs. We need to prevent this from tripping us up (today we never have such a gap) but the solution is to simply throw away the log, i.e. truncate to the AppliedIndex
. This wouldn't even cause snapshots, since we'll be a follower now, and the more up to date log is elsewhere and can catch us up just fine.
One practical issue might be that in order to apply an entry, you have to have the entry. If it's not in the log, it has to be in memory somewhere, and there are limits to that. The raft entry cache is already well suited to this sort of thing, though it isn't clear how to interface with etcd/raft
. It needs to tell us about new committed indexes, but today it will only ever communicate indexes for which it also surfaces the entries (i.e. it will never tell us 100 is committed, even if it is, unless it also gives us 100 to apply). So we'd likely want to relax that, i.e. have etcd/raft
always surface the highest known commit index, even if that index isn't even in the stable or unstable log. In effect, we've got to take another pair of batteries out of raft (the unstable log).
So we'd likely want to relax that, i.e. have etcd/raft always surface the highest known commit index, even if that index isn't even in the stable or unstable log. In effect, we've got to take another pair of batteries out of raft (the unstable log).
Could we talk more about this? I understand the problem, but came to the conclusion that the unstable log in etcd/raft
is exactly the thing that will help us here, as it means that etcd/raft
will have a way to find the entries without needing to rely on CRDB being able to locate them somewhere in the append pipeline before the append completes. It also means that we don't need to open the can of worms that is pinning entries in the raft entry cache and then promising that some subset of the cache will not be evicted.
For context, the way I've adjusted the unstable log in etcd/raft
to support https://github.com/cockroachdb/cockroach/issues/17500 is actually to not change its meaning (though I had to come full circle on this). Entries in the unstable log remain in the unstable log until the caller informs etcd/raft
that the entries are stable and can be reliably retrieved from Storage
. Here's the comment I added to the struct, if it's helpful (it was to me):
// unstable contains "unstable" log entries and snapshot state that has
// not yet been written to Storage. The type serves two roles. First, it
// holds on to new log entries and an optional snapshot until they are
// handed to a Ready struct for persistence. Second, it continues to
// hold on to this state after it has been handed off to provide raftLog
// with a view of the in-progress log entries and snapshot until their
// writes have been stabilized and are guaranteed to be reflected in
// queries of Storage. After this point, the corresponding log entries
// and/or snapshot can be cleared from unstable.
I understand the problem
Apologies, bringing up the raft entry cache here was premature. My main concern is being able to control memory use across a possibly large number of raft instances in a given process. Anything that lives "inside" raft typically gets a configuration parameter that limits memory usage for that raft instance alone, but this is insufficient for our usage[^recvq].
Taking a step back, our memory control issues already start way above raft, when messages enter the raft receive queue. From there they get stepped into RawNode, where they now linger until they've been fully processed and can get GC'ed.
At the same time, RawNode can pull entries from the log into memory, via raft.Storage
. And the allocs still get referenced from the outgoing msg queue.
Ideally we can curb and prioritize all of these in a coherent way that additionally allows for prioritization between ranges (system ranges > user ranges).
When a raft message arrives from "outside" (i.e. MsgApp), we could acquire from a mem budget[^drop]. The message then enters raft and likely the unstable log, though this is not guaranteed. Finding out when that entry goes out of scope seems really tricky and yet that's what ultimately we'd like to be able to do. Essentially it's out of scope when raft no longer holds a reference to it (i.e. it has left the unstable log, if it even entered it in the first place), and we don't hold a reference to it (it has been made durable, we've sent MsgApp to all followers, and it's been evicted from the raft entry cache). Of course we have the outgoing raft transport which has its own queue, further complicating the lifecycle.
Conceptually, it would be nice to live in one of two worlds, where either raft owns all memory or no memory. If it owns all memory, rather than having a "raft receive queue", we synchronously pass every incoming MsgApp to raft which will either accept it (accounting for it) or force us to drop it. Similarly, we ack every message sent to raft once it has actually been sent so that it can update its accounting. Or raft owns no memory (but we need to synchronize with it quite a bit, which doesn't seem great). So instead of an unstable log, it holds on to some abstraction that we own, and we keep doing raft receive queues, etc, and the abstractions are fallible and may force raft to defer work it was planning on doing. Of course there is a part of allocations that can never be "inside" of raft, namely the creation of new proposals.
This all seems fairly involved and deciding what to do about it isn't in scope for this issue.
But async append will hold on to memory for longer, so we need at least to make sure that there is some mechanism by which we degrade to sync appends after a certain inflight size. This could happen outside of raft, by waiting for the append to become durable (thus emptying out the unstable log) before continuing.
The second comment which was sort of jumbled up with my other one in the post above is that raft "hides" the true commit index from peers; it will only tell them about the one they can actually pull up the entry for (from stable storage). I thought this would have to change but I no longer think so because the committed index is sent on the MsgApp. So we just have to remember it somewhere and it becomes effective only when the append has become durable. So nothing surprising happens here.
[^recvq]: not something we do today but we're sort of close. [^drop]: (if we can't, we need to react and dropping the message isn't particularly helpful but let's talk about that another time).
There is a prototype in https://github.com/cockroachdb/cockroach/pull/98852 which allows a small number of writes to go through with a stalled disk at the leader. They currently stall because of logSyncQSem
and @jbowens anticipates that the next two roadblocks would be
once the logSyncQSem semaphore is out of the way, i expect the next two limits are: a) the log writer's 512kb buffers and b) memtable rotation, which can't proceed on a stalled disk.
So it looks like we have a good idea of what needs to be done here.
I'll note that #98852 injects the "disk" stall artificially via a custom vfs.FS
, so before we declare that we know how to fix this issue we should verify with a real disk stall. Concretely, any read that hits the disk (i.e. doesn't get served by pebble block cache or OS page cache) might prove fatal. In particular, if we find that a leaseholder cannot evaluate a single command on account of such a read, then it's effectively stalled since it won't produce any new log entries to apply.
Another gotcha that we would need to address (from this thread), if we're applying entries that are not stable locally, these entries might actually become stable before the log entries and so after restart we can end up in a new situation where log and applied index are not reachable from each other via the log (today, first index ≤ applied index ≤ last index).
some form of hedging seems necessary
Seems generally useful:
Another gotcha that we would need to address (from this thread), if we're applying entries that are not stable locally, these entries might actually become stable before the log entries and so after restart we can end up in a new situation where log and applied index are not reachable from each other via the log (today, first index ≤ applied index ≤ last index).
In a discussion with @sumeerbhola, we discovered that this is not the problem that it first appears in today's code. That's because as of https://github.com/cockroachdb/cockroach/pull/94165 and until we do https://github.com/cockroachdb/cockroach/issues/94853, raft log writes are synchronously sequenced with state machine writes in the same pebble engine and on the same raft scheduler goroutine. The durability of these raft log writes is async from the perspective of the raft state machine, just like the durabilities of state machine writes, but pebble ensures that the raft log writes will make it to stable storage before the state machine writes. In other words, even with async log writes and async state machine writes, state machine application will never outpace log writes for a given log entry to stable storage. So index ≤ applied index ≤ last index
will be preserved.
More so, the synchronous call to CommitNoSyncWait()
with each batch of log entries provides immediate read-your-writes visibility, even before the corresponding SyncWait()
has returned, so Cockroach should not observe issues with self-referential entry application like log truncation which reaches back around and manipulates the raft log. As long as these writes also go through pebble, they don't actually need to assume durability of the log entries that they manipulate. One concern would be self-referential entry application which reaches around pebble, but @pav-kv just took care of the one case I'm aware of with https://github.com/cockroachdb/cockroach/pull/114191.
let's ignore conf changes and the like
This has been a cause for concern with this issue, but it's not clear to me now if anything in the apply-time config change protocol would break. The visibility property above is pretty general. It means that if a leader were to apply a conf change before the corresponding log entry was durable locally, it wouldn't notice unless it crashed. And if it crashed, the leadership must fail over to one of the followers in the quorum who had voted for the conf change. The new leader would then need to apply the conf change before appending any new conf changes to its log. From the perspective of followers who comprise the quorum which committed conf change, the old leader not having the entry durably in its log before it crashed doesn't make a difference.
One edge case we haven't discussed yet in this issue is split leader/leaseholder scenarios. For kv write availability, it's important that the leaseholder applies and acks entries, not the leader. So this issue would be better titled "allow committing entries not in leaseholder's stable storage". Regardless, this situation is interesting because raft leaders do not always (ever?) communicate commit indexes to followers in excess of the follower's raft log tail. This is important because
// The leader MUST NOT forward the follower's commit to an unmatched index.
I'm not sure whether that's an implementation limitation (tests panic without this check) of whether it's more fundamental. For instance, it may be unsafe to tell a follower about a commit index if the leader isn't certain that the follower doesn't have a divergent log which has yet to be reconciled.
We would need to change this in order for a follower/leaseholder replica to apply and acknowledge writes that are not durable in its log. This case might have other availability issues as well. However, it may be rare enough to ignore for the initial version of the work.
The TL;DR of this is that things mostly "may just work", at least at the raft level. Many of the gotchas fade away because we perform the non-durable portion of raft log writes synchronously and pebble provides read-your-writes on these non-durable log writes up to the point where the leader crashes.
If this all is safe, then the question becomes: how much of a disk stall can be buffer over in pebble? To avoid impact to the workload, we at least need to be able to buffer over enough writes to give us time to transfer away or shed leases. Even if we can't buffer over a 30s stall, if we can buffer over a 6s stall and then add logic to not renew leases during a local disk stall, we should be able to survive.
Is your feature request related to a problem? Please describe.
The entire point of quorum replication is to provide high availability for writes. As long as a quorum of voters can write to disk & communicate, forward progress ought to be possible.
Unfortunately in practice,
etcd/raft
forces the leader to append to its local log before disseminating new log entries.A leader with a write-degraded storage thus renders the range inoperable to varying degrees.[^1] In other words, the blast radius is larger than necessary.
Generally, since users are sensitive even to short blips in latency, avoiding the high tail latencies of synchronous writes is a good idea. More symmetry helps with that.
Describe the solution you'd like
Improve the interface of
RawNode
such that it decouples the operations that move data to stable storage (voting, log appends, persisting the occasional commit index) from the other operations (in particular, sending messages). This requires or at least strongly suggests also doing #17500, since a leader that cannot apply entries that were committed without being locally durable will not release latches and thus continues to propagate the effects of its poor disk health to foreground traffic.Describe alternatives you've considered
We could more aggressively fail over away from leaders that are not performing as expected. In a sense, this would be the more "sturdy" alternative and possibly a better ROI if the main scenario we're trying to address are persistent degradations.
Decoupling the various Ready tasks should noticeable smooth out kinks for which a fail-over would be too heavyhanded a solution. It also allows us to markedly improve the performance in the steady state.
So we should really do both anyway.
Additional context
https://github.com/cockroachdb/cockroach/issues/88442 https://github.com/cockroachdb/cockroach/issues/88596 https://github.com/cockroachdb/cockroach/issues/17500
[^1]: until the disk degrades to the point where the leader fails to heartbeat followers at a sufficient frequency, at which point a new leader can be elected
Jira issue: CRDB-19840
Epic CRDB-40197