kvserver: provide an online mechanism to recover unavailable ranges

cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.

https://www.cockroachlabs.com

Other

30.13k stars 3.81k forks source link

kvserver: provide an online mechanism to recover unavailable ranges #41411

Closed ajwerner closed 1 year ago

ajwerner commented 5 years ago

Is your feature request related to a problem? Please describe.

When a cockroach cluster loses a quorum of replicas for a range, drastic action is require to return the cluster to a healthy state. The general approach today relies on shutting down the entire cluster and re-writing the range descriptors. This requires downtime even when the unavailable ranges are potentially housing non-critical data.

In general these situations ought to be quite rare. Ideally when this occurs the cluster state should be recovered from a backup. This however is not always possible and can require unacceptable amounts of data loss. As cockroach adoption increases we need to continue to improve the story around unexpected disaster recovery; not only will it sometimes happen but also customers expect there to be good answers. Furthermore as deployments become larger and more system critical, providing an online mechanism to resolve

Describe the solution you'd like

There are two cases for recovering from unavailable ranges:

1) All replicas are lost

In this case we'd like to effectively re-create the range with no data. In principle it seems possible in this case to perform a transaction that overwrites the meta range entry for the range in question with the desired range descriptor (which includes a quorum of live nodes) and then to synthesize as snapshot and send it to one of those nodes. This case seems relatively straightforward. Furthermore this technique could be useful to enable feature which benefit from operating with a replication factor of 1.

2) Some replicas remain

In this case we probably want to use the replica with the highest committed raft log entry as the source of truth (do we want to commit uncommitted raft log entries?). We could imagine stepping a config change in-memory on that replica which removes all of the other replicas and then executing an AdminChangeReplicas command with the old range descriptor to up-replicate. This should "just work". The change replicas protocol performs a CPut against the range-local copy of the descriptor and does a blind put against the meta range. It seems possible to remove dead replicas by stepping, in-memory, a configuration change which moves the range into a state where the surviving replica.

Open issues

Splits and Merges

If splits or merges have occurred (I'm not going to work through those cases here but suffice it to say they're detectable, though they may or may not be easy to deal with).

Committed Transactions

If there committed transactions which have had some of their intents resolved. In this case we risk making the database inconsistent. It would require a database scan to discover any such intents and its not clear what to do if such intents exist.

In use 1x replication use cases like import this concern may be able to ensure that there are no transaction records which could have been lost.

Others I'm sure I missed other challenges

Describe alternatives you've considered

Today's approach to this problem require customers to shut down their cluster and then manually re-write the range-local descriptors. Then, upon restart, the meta range will be momentarily inconsistent until the next change to any modified range.

There's something nice about knowing the cluster is not running while performing unsafe operations.

Jira issue: CRDB-6340

Epic CRDB-14205

tbg commented 5 years ago

PS:

providing an online mechanism to resolve

Finish that thought :-)

re 1: maybe this is what you meant, but I think we'd just set it to contain just a single replica (which we make sure is on a live node), and for which we assign a new ReplicaID.

re 2: I don't think it'd make a measurable difference in practice, but we should default to picking a survivor that could win an election among the survivors. This means it's going to be closer to "longest log" and less about the committed index. To artificially "crown" the leader I was thinking we'd do basically what you suggest but in a more nuanced way. The code around membership changes has lots of assertions to make sure that the descriptor always matches the active config, and that config changes are always compatible with the config they replace. This would all blow up, so I think we want to teach RawNode about recovery. My strawman is

func RecoverThisReplica(r *Replica, designatedSurvivor ReplicaID, deadReplicas ...ReplicaID)
r.withRaftGroupLocked(rn *RawNode) {
  cs := raft.ConfigOverride{designatedSurvivor, deadReplicas}
  rn.OverrideConfig(cs)
})

ConfigOverride would reach into rn.raft.prs and set a new (pointer) field here:

https://github.com/cockroachdb/vendored/blob/3a89d360b6ffed361211b1ca493a66e22bb3c134/go.etcd.io/etcd/raft/tracker/tracker.go#L117-L118

which is nil in normal operation and resets to nil when the config changes into something that has quorum (assuming all replicas except deadReplicas are available). While the field is set, it makes the config used for decision making that represented by pb.ConfState{Voters: []uint64{designatedSurvivor}}. (This should only be needed in a small number of places).

For example, in a seven node group r1/1-7 with five nodes down and survivors r1/1 and r1/2, we might designate r1/2 as the survivor (perhaps it has the longer log) and call ConfigOverride on its replica. Once this replica campaigns for the next time (it should skip pre-vote in this mode for simplicity) it will instantly declare itself leader. The replicate queue on that replica can then get a lease, and might perform the following replication changes:

add r1/8 (override stays in place: quorum is 5/8 but only 3 live)
remove r1/1 (stays in place, 4/7 but 3 live)
remove r1/2 (stays in place, 4/6 but 3 live)
atomic change replacing r1/3 with r1/9 (disables override, 4/6 but 4 live).
group continues recovery but without any override active

What I like about this approach is (well, other than that I think that it solves a problem we need to solve) that most of the logic lives in raft, where very good testability exists via testdata-type testing.

tbg commented 4 years ago

The thinking here has since involved since a real-world outage has forced us to come up with a one-time use solution to "reset" a range all of whose replicas was lost: https://github.com/cockroachdb/cockroach/pull/50387

The approach there was synthesizing a snapshot that was sent to a store, resulting in the creation of an empty 1-replicated replica, which could then upreplicate. I prototyped that the same approach can be used to "resuscitate" a surviving replica to let it make progress again in #50268. Roughly speaking, from the surviving replicas we pick the one with the highest applied index N, and send to it a "patch snapshot" that will be applied on top of the existing state. That snapshot essentially corresponds to applying an extra log entry that changes the range descriptor to contain only the surviving replicas as voters, so when that snapshot is applied, replication will resume and the range can make progress again.

It would be nice if instead of the largest applied index we could chose the replica with the longest log (i.e. highest last index), but this requires an extra step. We would first have to lie to the replica and tell it that the last index is committed, wait until the corresponding log entry is applied, and then do what we did before (synth a snapshot at lastindex+1=appliedindex+1). We have to add this extra step because we need to have the latest value of the range descriptor if we hope to patch it, which is only the case if the replica has fully applied its log. The replica won't do this on its own because the tail of the log may either not be committed, or not known to be committed on this replica. I am confident that it's not worth going through this extra trouble about the longest log; replicas apply their entries essentially as fast as they can, meaning that in practice the highest applied index is found on the replica with the longest log.

tbg commented 4 years ago

A gotcha is that even if we do recover, we're not in the clear. Obviously user transactions can be corrupted (if a transaction commits with a parallel commit, and we lose an intent during the recovery, we're losing writes), but the real trouble starts if we lose writes on the range descriptor or meta ranges.

For the range descriptor, concretely the problem is if the recovery undoes a split or merge. For example:

r1 commits a split (creating r2) which applies on replicas 1,2 but not 3
r2 starts doing some stuff
r1/1 and r1/2 go down
recover from r1/3 - lose the split, we now have an r1 that covers the original span
have r1 and r2 overlapping.

or:

r1 commits merge (subsuming r2), but r1/3 doesn't know about it
r1/1 and r1/2 go down, recover from r1/3
we have lost the data in the subsumed range and have a hole in the keyspace (since r1 is now the pre-merge range)

I think that if we have access to the meta2 ranges (i.e. if they're online and weren't corrupted) we can recover from these two as well: Before recovering a replica, we compare its descriptor to the meta2 copy. If the span agrees (does the generation also have to agree? thinking about split-merge cycles) we can go ahead and recover. If it disagrees, there are two cases:

the meta2 shows a split happened, but the replica does not have it: we have to emulate the split. The easiest way is to make the snapshot a full snapshot - load the data from the surviving range, and make a snapshot that replicas the replica's state.
merge: this is trickier - we are lost unless we are colocated with the subsumed range. Luckily, this is guaranteed by this code: https://github.com/cockroachdb/cockroach/blob/a578719850e3026bb7b09a345c31f33229f7eee4/pkg/kv/kvserver/replica_command.go#L740-L751 where basically we are making sure that the subsumed range is in the right place for the merge and is guaranteed to not move away (lest the merge aborts). So we do the same thing as for the splits, i.e. make a full snapshot reading from the local disk.

The fact that in these cases we want to create the snapshot using data from the surviving replica suggests that we should push the snapshot generation into the applySnapshot path on the designated survivor. That way we can create the SSTs for this snapshot in-place, without sending too much data over the wire.

tbg commented 4 years ago

If we instead lose a metaX range, we have a different problem. If we roll back meta1, we have to check its integrity and possibly rebuild it from the actual meta2 pairs, which can be done. If we roll back a meta2 range, we have to go through the cluster and look at the actual ranges and make sure that the rolled back meta2 range reflects the actual ranges in the cluster. This too can be done. Of course a combination of losing both a range and its corresponding meta2 entry will be more difficult to recover from.

Either way, I think there are diminishing returns here and while we, in theory, can always patch the cluster back to some "functional" (though possibly inconsistent) state, it will be less and less desirable to do so. An integrity check for the meta ranges is a good idea regardless. Past that, we're probably better off investing in the resilience of these system ranges through more advanced replication patterns.

One extra related thought that I wanted to leave here is that the meta1/2 ranges are not actually required for correctness - they are a performance optimization. DistSender can in principle reach out to all nodes in the cluster to route from span to range. This would be a terrible choice for serving any kind of production traffic, but it can be good enough to allow for a backup to be taken when meta ranges are unavailable.

tbg commented 4 years ago

As an aside: when a cluster has had this tool invoked on it, we need to durably persist that fact and make sure it shows up prominently in debug zip.

tbg commented 3 years ago

Reopening, as we intend to improve this tool to also use the data in surviving replicas.

tbg commented 3 years ago

Some more color on how this feature is important in ways that we didn't perhaps originally appreciate.

When a range is unavailable, this can lead to unresolvable deadlocks. This can happen for two reasons:

a txn is anchored to the unavailable range. This txn can neither commit nor abort without writing to the unavailable range.
a txn has an inflight intent from parallel commits on an unavailable range. This is very similar to the above point: parallel commits in effect disperses the status of the transaction across multiple ranges to avoid round-trips. The intent on the unavailable range needs to be either proven to exist or proven to never exist to commit or abort the transaction.

There's also a secondary issue, which is that keeping the above txns open causes cascading contention throughout the system. Solutions or improvements on that issue are tracked in https://github.com/cockroachdb/cockroach/issues/33007.

We need tools to restore service quickly by discovering and resolving transactions which are forced open due to an unavailable range. There are the following subcases for the txn record case:

when all replicas are lost, we are talking about https://github.com/cockroachdb/cockroach/issues/41411 for which we have a partially productionized solution. By recreating the range, the transaction will be considered aborted (due to its txn record missing). This can lead to dirty writes: if the txn in fact had committed, some of its intents may already have been resolved. Other intents may become deleted. It's interesting to note that if we stored txnIDs in committed KV pairs, we could in principle do a full KV scan and determine whether any of the KV pairs are already committed. If so, we would recover the range with a synthesized committed txn record. However - this is very complex. It's also worth noting that recreating a range as empty can break all kinds of other invariants and is extra difficult if the data loss occurs on a meta range and/or contains splits and merges.
when at least one replica survives, we can potentially recover a txn record (if not, we're practically in the above case). Our strategy would typically be recovering the range from the most up to date surviving replica in the hope that the txn record resulting from that is authoritative. If it is, there is no data loss. If it is not, then dirty writes (we abort some writes, but they were committed) could result.

Similar considerations apply to intents deciding the fate of a parallel commit. Harm is only done if a TxnCoordSender observed the parallel commit as succeeding, at which point it sends a committing EndTxn which may resolve intents. If we then wind back the txn to aborted, we may get dirty writes. Note that this would become more likely if we start including txnIDs in committed values at which point the TxnCoordSender could start resolving intents without first explicitly committing the txn (cc @nvanbenschoten isn't there an issue? I couldn't find one).

Looking at all of the above, it seems that as a first pass, we need good tooling (i.e. this issue) to simply get a range back to life. This will often mean accepting that a transactional anomaly may have been introduced, but not always! There are cases in which a range is unavailable but no data is lost, see the issue triggering https://github.com/cockroachdb/cockroach/issues/60612.

nvanbenschoten commented 3 years ago

Harm is only done if a TxnCoordSender observed the parallel commit as succeeding, at which point it sends a committing EndTxn which may resolve intents. If we then wind back the txn to aborted, we may get dirty writes.

This is an interesting point. You're saying that we don't resolve intents (today) until after a transaction has been explicitly committed. This means that we can get out of the fragile implicit commit state relatively unscathed by skipping the distributed txn recovery protocol and moving the transaction record from STAGING to ABORTED. This would not result in dirty writes and would only result in a "durability violation", where a transaction commit was acknowledged but then not upheld. That doesn't seem too bad if we're already in this state.

Note that this would become more likely if we start including txnIDs in committed values at which point the TxnCoordSender could start resolving intents without first explicitly committing the txn (cc @nvanbenschoten isn't there an issue? I couldn't find one).

The only thing I'm aware of is a reference to the "extended protocol" in the original parallel commits RFC. IIRC, you had laid out the full thing in an early iteration of the RFC, but then scoped the project down to avoid needing a separate proposal about storing txnIDs in committed values.

andreimatei commented 3 years ago

A recent development that deserves a mention here is the Raft-based closed timestamps - which mean that we're now recording information about the closed timestamps of different ranges/replicas. This opens the door to doing cluster-wide recovery by rolling back all the ranges to the lower bound of the closed timestamps across all the replicas that we still have access to. For example, if you have range 1 with replicas a,b,c and range 2 with replicas d,e,f, you consider the closed timestamp for r1 to be ct1=max(a,b,c), the closed timestamp for r2 to be ct2=max(d,e,f), and the cluster-wide closed timestamp min(ct1,ct2). Then, I think you need to backtrack this timestamp below any intent that's still present in the cluster whose txn status cannot be determined. And then you discard all data with higher timestamps, across the cluster.

tbg commented 3 years ago

Thinking about this again.

I think the main complexity when recovering vanilla user ranges comes from dealing with edge cases related to splits and merges. If a range recently split, and then lost quorum with only a "slow" follower suriving, we have to consider all of the following:

the replica may reflect the pre-split (i.e. wider) state
yet, the RHS does exist. Or maybe it doesn't, in case the split didn't apply. We can't know whether it did, as this follower may not even have the intent on the range descriptor, but it definitely does not have the committed intent applied (it could have that in its raft log, or not).
meta2 may have an intent that relates to the split. If the split committed, the intent would need to be committed for sure, as the replicas for the RHS are around (somewhere in the cluster, unless they too have been lost...) but again, we don't know and would need to make an educated guess (i.e. by contacting the nodes listed in the intent and seeing whether the replicas are there).
but the intent is abortable the moment we recover the LHS replica, unless we also synthesize a suitable committed transaction record when bringing the LHS back to life
depending on the committed status, we also can't just bring the LHS back to life, we need to suitably truncate it, which means simulating a split... or at least parts of a split.

If this isn't unsavory enough yet, consider that the slow follower may have missed multiple splits, rebalances, merges, etc. I think this is all prohibitively complex. And yet, we want a tool that can reliably get the user out of certain kinds of pickle.

Splits have a history of causing complications, look no further than this code

https://github.com/cockroachdb/cockroach/blob/295338b6073c642751f656eccce93b08d55bf856/pkg/kv/kvserver/replica_command.go#L390-L396

or this code

https://github.com/cockroachdb/cockroach/blob/6e4c3ae05d92f4104b7b9fbb33f9d1cecf055fd3/pkg/kv/kvserver/split_trigger_helper.go#L49-L59

so maybe instead of trying to fully solve for restoring quorum across splits, we can find ways to make sure we will "realistically" never end up in that situation when needing a recovery. I think this could be achieved by committing, irrevocably, the plan to carry out this split and making sure that all followers have durably learned of that plan. That way, regardless of which follower survives, it will know of the plan and must execute on it. But even that is fraught with peril; when the restored follower executes on the plan, I think it must not create a right-hand side replica. This is because if the plan had previously been executed by the majority, they may have created right-hand side replicas already, and these right-hand side replicas may have picked up writes from the LHS that are missing from our lagging follower LHS. We would thus run the risk of seeding a late RHS that isn't byte-by-byte comparable at the initial log entry 10, and thus wouldn't be comparable later. This means the consistency checker would eventually detect this.

Something that has a better chance of working is to make sure the split never gets applied in the first place, unless it is in all followers logs (in a way that makes it impossible or at least very unlikely that it will be replaced), i.e. requiring that the split commit is replicated with a quorum policy that requires participation of all replicas. I think this solves all of the above problems, as it makes sure the split happens deterministically at a fixed log position, no matter the survivor (and without a chance for writes to sneak in that would "adulterate" the right-hand side). But it creates a worse problem: we're now looking at a loss of quorum scenario whenever any follower goes down in the middle of a split. So perhaps instead the way to go is an "advisory" hint to Raft to try to persist this command everywhere (but only within reason, i.e. to followers that are known to be around and in good health); or to replicate it "as usual" but delay the surfacing of committed commands to the apply loop until all reasonably present followers have acked the corresponding append.

There are yet more invasive options, like separating the cluster membership system from the per-range replication system so that it becomes "fine" to require all replicas to participate in a commit, by making it cheap and very highly available to change membership. (In our current model, the range membership is owned by the range itself and done through consistent transactions that aren't local to the range, so everything is complicated).

tbg commented 3 years ago

As an afterthought, it seems that we need to get something working that doesn't handle this "split rollback" scenario first, as we'll otherwise be without a tool for quite a while.

nvanbenschoten commented 3 years ago

@tbg reading through your comment, it sounds like there's a baked-in assumption that we want to roll splits/merges forward. Is that correct? I've been thinking about this problem a little differently. Just like we intend to roll back all MVCC state to the lowest common denominator across ranges to establish a consistent prefix, I was thinking that we would roll back incoherent replication changes to the lowest common denominator. Range descriptor generations give us a total order here for ranges with overlapping key bounds, so given two overlapping ranges, we can always determine who is staler.

I'm sure you've thought through approaches along these lines as well. What problems do they run into? What problems do they avoid?

tbg commented 3 years ago

There are different things we could do lumped into one issue. I was mostly thinking through a "straightforward" way to bring a single range back (because maybe that's all you've lost), because anecdotally that's what everyone needed so far. And I am pointing out that this already gets surprisingly difficult - it's not trivial to undo a split, or anything on a range really, because this might always entail corrupting the global internal cluster state in ways that are hard to reliably fix. And working reliably is important here, and consequently I'm looking for things we can do that are "obviously" correct, i.e. anything that looks like open heart surgery whack-a-mole is not something I am keen on considering. So for simply restoring quorum to a single range (with possible data anomalies) I think we need to require that meta2 is available, check for the intent or committed value that would be left by a split and if one is there, refuse to recover (or at least require an additional override that we can use in case we have verified that the RHS isn't around; this could perhaps also be automated, but that's getting a bit into diminishing returns I think).

When we recover an entire cluster to an up-to-date consistent state, I think there's really conceptually only a single way to go about it, which is the one you mentioned, picking a cluster-wide resolved timestamp and reverting to that. But here too, I think it will be too difficult to do this on the running cluster, or even "to the same cluster". We also have non-MVCC data that we would need to worry about. All in all, and I haven't thought about this deeply, I gravitate here towards an "in-place bootstrap" approach, where we make a new cluster with the same user data, roughly

bring cluster down (or if we want it to stay up to serve some historical reads, create a checkpoint on each node and operate on that)
on each node (that is left), run something like ./cockroach debug recovery-digest which dumps roughly a range status report, i.e. all range descriptors, their resolved timestamp and applied index, etc.
all of these get pulled to a central location and ./cockroach debug recovery-plan gets run on all of these files, this results in a recovery plan for each node
./cockroach debug recovery-bootstrap <input_file_for_node> initializes the node as part of a (newly created) cluster with the same split points, using existing on-disk data (rolled back via in-place RevertRange and ClearRange). One node is chosen to also bootstrap the CRDB system ranges (and liveness records for the other nodes); i.e. we're generalizing the notion of bootstrap here.

It doesn't have to be as offline as that, but I think trying to repurpose the ailing existing cluster for any of this stuff goes back to "open heart surgery" territory; doing it offline ensures that nothing can block or get stuck and widens the scope in which the tool can apply to be the widest possible.

mwang1026 commented 3 years ago

I think we need to require that meta2 is available, check for the intent or committed value that would be left by a split and if one is there, refuse to recover (or at least require an additional override that we can use in case we have verified that the RHS isn't around; this could perhaps also be automated, but that's getting a bit into diminishing returns I think).

I think it's OK to require meta2 to be available. re: refuse to recover that is the part that I think could be frustrating to users. I think it's OK to require some manual checking if RHS is available, but would have to run thru the UX to see.

When we recover an entire cluster to an up-to-date consistent state, I think there's really conceptually only a single way to go about it, which is the one you mentioned, picking a cluster-wide resolved timestamp and reverting to that.

I think we can treat this as a separate problem with separate scope and separate prioritization. I agree this would likely need to be / safer to be offline / some special mode.

tbg commented 3 years ago

erikgrinaker commented 1 year ago

A "half-online" loss of quorum recovery variant is shipping in 23.1 (see RFC). This only requires a rolling restart of the affected nodes (i.e. the ones with remaining replicas), which I think is good enough in most cases.