Open sumeerbhola opened 3 years ago
If we build a quarantining / mark unavailable tool we'll want a way to get it out of that state too. That is probably obvious but wanted to make note here so that we don't forget
With serverless coming, the priority of this kind of thing goes up. We need to avoid host cluster wide outages, or at least keep them very short. A panic in the raft log would cause one (granted not host cluster wide but would affect multiple customers with no clear mitigation path).
@joshimhoff I think what you really need is this #33007, I am skeptical of the need for this quarantine tool in the short term (i.e. next release) once we have the circuit breaker. The original way we got into this issue at the customer is no longer possible and to even get into that state they had to muck around with esoteric cluster settings, which we are unlikely to do in the cloud unless there is something bigger happening.
TL;DR I think we ultimately want an offline tool that zeroes out a raft entry for a given rangeID on a store and rely on improving unsafe-remove-unsafe-replicas for everything else. But step one is making replicas unavailable instead of crashing on apply-loop failures and working on the circuit breakers. (#33007).
Longer circuitous train of thought follows:
If a raft command reliably panics the node on apply, we have a problem. We can mitigate that to some degree by appropriate use of defer recover()
and stalling the range (and should do that as part of closing this issue). Together with the circuit breakers (#33007) this preserves availability as much as can reasonably be expected. But if a raft command tries to allocate 10000TB of memory, the runtime will still crash the process, and either way, we need a mitigation once we've gotten to the state in which a node is either down, or up but with that range stalled.
In https://github.com/cockroachdb/cockroach/issues/70860 I suggest an offline tool that replaces a given raft log entry with an empty entry. That seemed like an appropriate first step (but keep reading) and provides a way out if the panic is completely deterministic - then we know that no replica will successfully apply whatever the command was trying to do to the state machine. We can thus use the tool the replace the log entry with a benign one (either completely offline or in rolling-restart fashion).
Where it gets tricky is if the panic is only "approximately" deterministic. Say two out of three nodes have their replica of r1 stalled, but the third node managed to push through for some reason (maybe got lucky). Then we don't want to replace the entry in the first two nodes' log, or the replica inconsistency checker will get us later. What we'd want to do in this case is truncate the log (on the two nodes) such that the offending entry is gone. The raft leader (node that managed to apply the entry) can then replicate snapshots to them and things will be back to normal.
So we need two offline tools:
Neither are terribly difficult to write.
We'd use them as follows:
We cannot get away with using the truncate tool for the "all replicas stalled/crashed" case unfortunately - at least not easily. The basic problem is that we need the state machine to have its AppliedState at or above the problematic index, but this won't be true (due to the crash). We could say we can move it up by one as part of the recovery, but this is fraught as well: the applied index may lag the problematic command (since application of command isn't synched) and so we'd also have to teach the tool how to apply raft commands, which is unrealistic.
However, there's also the opportunity to replace the "offline raft log truncation" tool with (an improved) cockroach debug unsafe-remove-dead-replicas
(see https://github.com/cockroachdb/cockroach/pull/70623): if all followers crash on the entry, we need to use the log-overwrite tool and are then done. Otherwise, use unsafe-remove-dead-replicas to selectively restore quorum to the nodes that did manage to apply the command. Note that this assumes that unsafe-remove-dead-replicas will actually remove the non-chosen survivors (i.e. the replicas that didn't manage to apply the log). If it didn't do that, we'd be in danger of some of them joining up to form quorum and put the range in a split-brain state. Also, they will continue to crash. Of course if there is no survivor we can pick we need to do and zero out the log entry on all replicas.
This all makes sense to me.
In the longer term, as CC serverless grows in scale & reliability requirements, I think the simplicity of saying "this range is now unavailable" via cockroach cordon range_id
or similar is powerful. With such a tool, an SRE (instead of a KV dev) can look at a panic tied to the application of some entry on some range or similar & immediately mitigate in a way that is quite safe, as the raft log is not messed with (something SREs should clearly not do without KV help). After this, then the SRE can escalate to a KV dev to mess with the raft log to resolve the outage. One serverless customer down is much preferable to many down, and also SRE will be able to run a command before a KV engineer, given that (i) SRE receives the page first & (ii) that KV L2 has some non-zero time to keyboard.
But if a raft command tries to allocate 10000TB of memory, the runtime will still crash the process, and either way, we need a mitigation once we've gotten to the state in which a node is either down, or up but with that range stalled.
Yes agreed.
re: cockroach cordon
what would you expect the behavior to be? that it would circuit break? and how might one "uncordon" that range?
re: cockroach cordon what would you expect the behavior to be?
Yes, we'd immediately return errors when ops on the range are made, as in circuit breaking (IIUC circuit breaking). We'd also pause async machinery such as the raft apply loop, hence the ability to mitigate stuff like:
But if a raft command tries to allocate 10000TB of memory, the runtime will still crash the process
and how might one "uncordon" that range?
cockroach uncordon
or some similar CLI interface?
Here is a related POC: https://github.com/cockroachdb/cockroach/pull/78092. In the POC, the capability is called "cordoning", and it applies to a replica, rather than a whole range. In the POC, cordoning also happens automatically in face of a below apply panic. There is no CLI tool to trigger it manually, but that could be changed.
This came up in the postmortem for internal issue https://cockroachdb.zendesk.com/agent/tickets/21124, where only a handful of ranges were affected but LoQ tool had to be run on 1000s of ranges since some nodes were crash looping.
This came up in the context of a postmortem where reading an extremely large value in the raft log for a range was causing a panic and node crashes. In some cases such crashing can spread, if different nodes get picked to host a range (and the crashing does not happen in the act of generating the range snapshot itself). Even if it does not spread, repeatedly crashing all the nodes that host replicas of a range can cause extreme degradation of the cluster due to the unavailability of other ranges hosted by those nodes, or make most of the nodes unavailable for small clusters.
The concept of range quarantining exists in other distributed range partitioned systems. In the CockroachDB context this would extend down to the raft log, to prevent any reads/writes to the log. Although other systems have support for automated quarantining, we would start with manual quarantining, where a human could take into account the other consequences of range data unavailability. Any queries reading/writing to that range would immediately return with an error and the corresponding transactions would fail.
Doing quarantining is not straightforward in the CockroachDB context: unlike a database that disaggregates storage for a range into a distributed file system, such that data availability can be maintained independent of the database, turning down node(s) that hosted replicas of the range will decrease the number of copies and can in the extreme result in loss of all data for that range. So we would want to keep track of the current replication levels for a range, and which replicas are up-to-date wrt having all the committed entries in the raft log, and warn admins from turning down all such up-to-date nodes. Also implied in the above is that rebalancing of such quarantined ranges is not possible -- this means disk space pressure at a node will need to be alleviated by moving one of the non-quarantined ranges.
cc @andreimatei @lunevalex
Jira issue: CRDB-2825