kvserver: reduce blast radius of Raft application errors

cockroachdb / cockroach

CockroachDB - the open source, cloud-native distributed SQL database.

https://www.cockroachlabs.com

Other

29.58k stars 3.7k forks source link

kvserver: reduce blast radius of Raft application errors #75944

Open joshimhoff opened 2 years ago

joshimhoff commented 2 years ago

Is your feature request related to a problem? Please describe. Non-deterministic failures to apply a raft command can crash the node:

This poses a problem for replicas that fail for any reason to apply an entry. If the failure wasn't deterministic across all replicas then they can't carry on applying entries, as their state may have diverged from their peers. The only reasonable recourse is to signal that the replica has become corrupted. This demonstrates why it is necessary to separate deterministic command failures from non-deterministic state transition failures. The former, which we call "command rejection" is permissible as long as all replicas come to the same decision to reject the command and handle the rejection in the same way (e.g. decide not to make any state transition). The latter, on the other hand, it not permissible, and is typically handled by crashing the node.

If the non-deterministic failure to apply will actually happen repeatedly until an operator takes some action, we have a large scale outage on our hand with high impact & MTTR. I won't say which customer but a recent on-prem customer experience just such an outage.

Describe the solution you'd like Re: outages of this sort, @erikgrinaker says at https://github.com/cockroachdb/cockroach/issues/75903:

Yeah, I think there's plenty of stuff to pull at here in terms of offline recovery tools, blast radii, testing, and general resilience.

Building on that, I see why we crash the node, but I don't see why we scale of the outage needs to be quite as large as it is today.

Can we add an exponential backoff or similar to the per-replica apply loop? That is, if application keeps failing, continue crashing & retrying as needed, but retry application with greater and greater sleep times between last application and retry?

Then the node will eventually stay up for a longer time, allowing it to serve other requests, in particular requests to ranges other than the one that has the command that cannot be applied.

This would especially be nice in a serverless context. In that context, we'd really like to keep outages from affecting many tenants at once. In the current state, an outage involving repeated (but non-deterministic) failure to apply some command post commit would affect all tenants with data hosted on the multiple crashing nodes. If we do something like what is proposed here, perhaps the impact would be mostly limited to the tenant whose range has the command that can't be applied.

Also, adding an exponential backoff sounds pretty straight-forward, tho maybe it's not due to details I'm missing...

Describe alternatives you've considered

Do nothing! Focus on not having bugs of this sort!
https://github.com/cockroachdb/cockroach/issues/75903

Additional context N/A

Jira issue: CRDB-12888

Epic CRDB-39898

erikgrinaker commented 2 years ago

There are two kinds of failures to consider. Both currently crash the node.

Errors: these can arguably be retried, provided we're careful about leaving internal state consistent on errors.
Panics: these can leave inconsistent internal state, and are arguably not safe to retry.

The recent escalation that triggered this was a panic. However, that was due to insufficient bounds/overflow checks in Pebble, and it should arguably have been an error instead. So this seems to be the path forward:

Make sure the Raft application code path can safely handle errors while keeping its internal state consistent.
Retry Raft application on errors with exponential backoff.
Avoid panics as much as possible in code (including e.g. bounds/overflow checking), returning errors instead.
Accept that the occasional panic will crash the node, but aim to have 0 panics through e.g. better testing.

joshimhoff commented 2 years ago

I think I'm suggesting a different approach. Keep crashing the node, regardless of whether it's a panic or an error. But when the node starts up again, don't retry doing the apply immediately, as we do today IIUC hence the crash loop experienced by the recent on-prem customer. Retry it with an exponential backoff policy governing time between attempts.

Clearly to determine how long to wait between apply attempts, we need some way to know or at least approximate how many attempts at doing some application have happened already. We can't use memory since the node will be restating. Perhaps the time since the command was committed gives us such an approximation? Or perhaps we can write a counter to disk keeping track of the number of application attempts?

I wonder if this idea is also not feasible for reasons I will soon understand!

But what is nice about it is if I do understand correctly is that it reduces blast radius without requiring KV to ensure no inconsistent state is left around in case of errors / catch panics / stop crashing, which all sounds hard, given all the correctness details you were mentioning in the last ticket, and which needs to be done on a case by case basis.

petermattis commented 2 years ago

I'm extraordinarily nervous about anything which tries to recover a range from an unexpected condition and keep it available. I'd much rather take an approach suggested in your last message, @joshimhoff, which is to try and cordon off and isolate a range/replica that is continually causing a crash. Doing so would at least keep the cluster up, though it might not be usable as the range may be unavailable and part of a table that is critical to the health of the cluster or application. For example, consider what would happen if the unavailable range was holding system tables. The main advantage I see to isolating problematic ranges like this is that we'd hopefully have more online debugging tools to diagnose what is going on, such as the range debug pages.

joshimhoff commented 2 years ago

I'm extraordinarily nervous about anything which tries to recover a range from an unexpected condition and keep it available.

Makes sense!

The main advantage I see to isolating problematic ranges like this is that we'd hopefully have more online debugging tools to diagnose what is going on, such as the range debug pages.

Plus, as you say, there's some chance that the outage doesn't actually break the whole application. Considering how the CC console uses CRDB, one can imagine various single range outages that would leave the CC console degraded but still working pretty well.

And also with serverless, so long as the broken range is NOT a system range, we likely would have a KV cluster that works fine for all tenants other than the one whose range has the command that can't be applied. That's a big enough reduction in blast radius in an important CRL product to be motivating IMO.

erikgrinaker commented 2 years ago

We could combine the above error handling sketch with @tbg's new replica circuit breakers: if application errors we trip the circuit breaker (cordoning off the range), then periodically retry application and untrip the breaker if it succeeds. The breaker already trips on hung application. I don't think I'd extend that to panic recovery though, but we could if we wanted to.

tbg commented 2 years ago

then periodically retry application and untrip the breaker if it succeeds.

I wouldn't do this as a first pass. When an error results, we might already be in an inconsistent state. I'd stay close to what Peter suggested above: permanently trip the breaker for that replica until the node is restarted.

The story doesn't end there, though. If that node holds the lease, this makes the range unavailable, so it's desirable to be able to shed the lease. We probably don't want to let liveness expire - that's too heavy-handed, but if we had lease transfers that don't hop on the replication layer (i.e. an RPC to another node that tells it that we stopped using our lease as of timestamp XYZ and lets it steal the lease) then we could straightforwardly do it. I am sure we discussed such a mechanism before, but can't find the issue now. Might be worth filing separately but I'll see if anyone here remembers first.

tbg commented 2 years ago

Then there's the next layer to peel off - when there are lease preferences set, what stops the other nodes from sending the lease right back in case the unhealthy replica is the preferred leaseholder? It all comes down to a need to communicate per-replica health between nodes (or requiring opt-in from the recipient for any operation).

erikgrinaker commented 2 years ago

It's always seemed really unfortunate to me that a broken leaseholder replica won't passively lose the lease, since we tie the lease to node liveness rather than replica liveness (we have the same problem e.g. in the case of a replica deadlock). I don't know what the best solution is, but I think that's the problem we should solve rather than introducing side channels for active lease transfers.

tbg commented 2 years ago

Yes, I sympathize with that point of view. Before epoch-based leases (when all leases were expiration-based), we had this property. To limit the per-range overhead, we went and introduced epoch-based leases and are now finding the opposite, that more granularity would be desired. When you have an epoch-based lease, the assumption is that all replicas on that node are "healthy". But now we're finding this not to be true. What we want is almost like making the circuit breaker state for all replicas known to each node (eliding the healthy breakers). The naive solution, writing a []RangeID to the liveness record, is obviously not great (1MB uncompressed is around 125k RangeIDs, not bad but too big for KV and gossip in big clusters that may have a multiple of that), but writing it to a keyspace that is partitioned by node and to which each node can maintain a rangefeed is starting to look better. Still, we're unhappy with our current liveness-on-top-of-kv approach, so we may not want do keep digging that hole.

To come back to out-of-band lease transfers, they might be simpler and more contained, so there is possibly still a role for them. But as mentioned before, they don't solve the general problem. Ultimately, the cluster needs to know which replicas are having trouble to avoid pathologies.

sumeerbhola commented 2 years ago

https://github.com/cockroachdb/cockroach/issues/62199 is related to some of the discussion here

joshimhoff commented 2 years ago

Just talked with @erikgrinaker. I'm playing around with a POC of this during breather work. At this point, my goal is more learning than building something useful. Though I follow the above discussion about shedding leases, if I ignore that given that we have a separate ticket for https://github.com/cockroachdb/cockroach/issues/77035, I have an idea of how to implement this that doesn't seem so complicated. I figure the reason it doesn't is simply bc/ I don't understand KV that well, and one way to "fix" that is to write enough code that I understand things better. What better week to do that than breather week.

I'd stay close to what Peter suggested above: permanently trip the breaker for that replica until the node is restarted.

I don't see why we need the breaker. What I imagine doing is:

If application error that will lead to panic, before panicing, write the current applied index + 1 (the index of the entry where application failed), the range ID, & current time to a file on disk.
Similar to above but for panics: Recover from panic & write the applied index + 1, the range ID, & current time to a file on disk. Then panic again.
Use on disk log of application failures to implement an exponential backoff on applying some entry.

Above will reduce rate of crashes without operator involvement.

erikgrinaker commented 2 years ago

Then panic again.

I'm not sure why we'd panic again? Seems like we should just recover the panic and keep the node running, with an exponential retry in the Raft scheduler. We're likely vulnerable to inconsistent data during the backoff delay anyway, so we don't gain anything by crashing the node (or maybe not because there may not be a lease but that's sort of beside the point here).

I don't see why we need the breaker.

To prevent anyone from accessing the range and reading incorrect data. For your purposes here, you can probably ignore this.

erikgrinaker commented 2 years ago

I suppose the simplest possible solution would be to keep some state on the replica about the last attempt, and then return early from Store.processRead() if that hasn't been reached yet. That's also where you'd catch the error. In this vicinity:

https://github.com/cockroachdb/cockroach/blob/98a66b59d6a23b970159743c11c24d947c757f11/pkg/kv/kvserver/store_raft.go#L505-L506

Alternatively, you'd keep track of it in the Raft scheduler around here and avoid scheduling the range again until the backoff timer expires:

https://github.com/cockroachdb/cockroach/blob/86f51efb26c709192f0aa4c839930334270ffa42/pkg/kv/kvserver/scheduler.go#L307-L309

joshimhoff commented 2 years ago

Thanks for the pointers!!

We're likely vulnerable to inconsistent data during the backoff delay anyway, so we don't gain anything by crashing the node

It sounds like you are saying that any inconsistency issue will be eliminated once the entry for which application (at first) errored or paniced is successfully applied (that is, on retry)? As a result of ^^, there is no need to restart the server? OTOH, if we weren't confident about ^^, it would be wise to restart the server before backing off.

I was partly basing the alway panic suggestion on what @tbg said here:

I wouldn't do this as a first pass. When an error results, we might already be in an inconsistent state. I'd stay close to what Peter suggested above: permanently trip the breaker for that replica until the node is restarted.

To prevent anyone from accessing the range and reading incorrect data

Ah I see! I didn't realize the inconsistency post apply failure or panic issue could lead to inconsistent reads. I thought it was on the write side instead. What I thought was: To retry the application, it is best to restart the server first, else we risk inconsistency issues.

I think the reason I thought it was NOT on the read side is that this suggests an apply failure or panic could lead to an inconsistent read today: Reads & applies happen concurrently. There is some time delta between the apply failure or panic happening and the server actually restarting obviously, so you'd just need to get unlucky with the timing of the apply failure or panic & the read.

erikgrinaker commented 2 years ago

It sounds like you are saying that any inconsistency issue will be eliminated once the entry for which application (at first) errored or paniced is successfully applied (that is, on retry)? As a result of ^^, there is no need to restart the server? OTOH, if we weren't confident about ^^, it would be wise to restart the server before backing off.

You're right. Application is unlikely to succeed again on a retry, and we probably wouldn't want to recover even if application did succeed.

But I think the main motivation here is to avoid restart loops, which are very disruptive, and even with exponential backoffs we would still be in a restart loop.

I wouldn't do this as a first pass. When an error results, we might already be in an inconsistent state. I'd stay close to what Peter suggested above: permanently trip the breaker for that replica until the node is restarted.

Yeah, that's probably prudent -- if application fails, just make the replica unschedulable until someone looks at it and restarts the node.

Ah I see! I didn't realize the inconsistency post apply failure or panic issue could lead to inconsistent reads. I thought it was on the write side instead. What I thought was: To retry the application, it is best to restart the server first, else we risk inconsistency issues.

Yes, application is always on the write path, but if someone then comes along and tries to read data that was only partially updated during application it would read inconsistent data.

You're right that it's probably safer to restart before retrying, to at least avoid inconsistent state in memory. However, application is very unlikely to succeed again after the restart, so I'd argue that it's better to not restart automatically and wait for someone to fix the underlying problem and restart the node.

I think the reason I thought it was NOT on the read side is that this suggests an apply failure or panic could lead to an inconsistent read today: Reads & applies happen concurrently. There is some time delta between the apply failure or panic happening and the server actually restarting obviously, so you'd just need to get unlucky with the timing of the apply failure or panic & the read.

Yeah, this depends on the Go scheduler, who may schedule some other thread during the panic. But we don't really have any control over that.

joshimhoff commented 2 years ago

I've written a very hacky POC. I wrote the following patch to test locally:

~/g/s/g/c/cockroach [working] $ git diff pkg/kv/kvserver/replica_application_state_machine.go
diff --git a/pkg/kv/kvserver/replica_application_state_machine.go b/pkg/kv/kvserver/replica_application_state_machine.go
index 4a9fbcc522..8da73c251c 100644
--- a/pkg/kv/kvserver/replica_application_state_machine.go
+++ b/pkg/kv/kvserver/replica_application_state_machine.go
@@ -870,6 +870,10 @@ func (b *replicaAppBatch) ApplyToStateMachine(ctx context.Context) error {
                }
        }

+       if b.r.RangeID == 54 {
+               panic("josh second boom")
+       }
+
        // Apply the write batch to RockDB. Entry application is done without
        // syncing to disk. The atomicity guarantees of the batch and the fact that
        // the applied state is stored in this batch, ensure that if the batch ends

After a panic, I see a log line on startup indicating that backing off is indeed happening

~/g/s/g/c/c/c/logs [working] $ cat cockroach.crlMBP-C02CC5FRMD6TMjYz.joshimhoff.2022-03-15T20_36_40Z.044893.log | grep W
I220315 20:36:40.339405 1 util/log/file_sync_buffer.go:238 ⋮ [config]   line format: [IWEF]yymmdd hh:mm:ss.uuuuuu goid [chan@]file:line redactionmark \[tags\] [counter] msg
W220315 20:36:40.339152 1 1@cli/start.go:1146 ⋮ [n?] 1  ALL SECURITY CONTROLS HAVE BEEN DISABLED!
W220315 20:36:40.339152 1 1@cli/start.go:1146 ⋮ [n?] 1 +
W220315 20:36:40.339152 1 1@cli/start.go:1146 ⋮ [n?] 1 +This mode is intended for non-production testing only.
W220315 20:36:40.339152 1 1@cli/start.go:1146 ⋮ [n?] 1 +
W220315 20:36:40.339152 1 1@cli/start.go:1146 ⋮ [n?] 1 +In this mode:
W220315 20:36:40.339152 1 1@cli/start.go:1146 ⋮ [n?] 1 +- Your cluster is open to any client that can access ‹any of your IP addresses›.
W220315 20:36:40.339152 1 1@cli/start.go:1146 ⋮ [n?] 1 +- Intruders with access to your machine or network can observe client-server traffic.
W220315 20:36:40.339152 1 1@cli/start.go:1146 ⋮ [n?] 1 +- Intruders can log in without password and read or write any data in the cluster.
W220315 20:36:40.339152 1 1@cli/start.go:1146 ⋮ [n?] 1 +- Intruders can consume all your server's resources and cause unavailability.
W220315 20:36:40.339846 1 1@cli/start.go:1072 ⋮ [n?] 3  ‹Using the default setting for --cache (128 MiB).›
W220315 20:36:40.339846 1 1@cli/start.go:1072 ⋮ [n?] 3 +‹  A significantly larger value is usually needed for good performance.›
W220315 20:36:40.339846 1 1@cli/start.go:1072 ⋮ [n?] 3 +‹  If you have a dedicated server a reasonable setting is --cache=.25 (8.0 GiB).›
W220315 20:36:40.498266 48 server/status/runtime.go:325 ⋮ [n?] 15  could not parse build timestamp: ‹parsing time "" as "2006/01/02 15:04:05": cannot parse "" as "2006"›
W220315 20:36:41.246871 48 1@gossip/gossip.go:1472 ⋮ [n1] 20  no addresses found; use --join to specify a connected node
W220315 20:36:41.273064 48 server/server.go:2676 ⋮ [n1] 31  failed to start query profiler worker: failed to read ‹memory› cgroup from cgroups file: ‹/proc/self/cgroup›: open ‹/proc/self/cgroup›: no such file or directory
W220315 20:36:51.273720 403 1@server/status/runtime.go:468 ⋮ [n1] 56  unable to get file descriptor usage (will not try again): ‹not implemented on darwin›
W220315 20:37:00.263205 366 kv/kvserver/replica_raft.go:1011 ⋮ [n1,s1,r54/1:‹/{Table/54-Max}›,raft] 57  backing off for 5m0s because recent apply panic

The POC is here: https://github.com/cockroachdb/cockroach/pull/77858. The POC catches a panic in apply, records info about the panic needed to implement a backoff using the storage engine, and then re-throws the panic. The POC also backs off at startup time in case of a recent panic with the right current applied index. A new store local key is introduced for recording info about panics.

This POC doesn't trip the breaker. I think I understand why we need to do that now: IIUC, the code I added introduces additional time between panic and server crash (the time waiting on storage engine to record info about the panic), and we don't want reads to be served then.

This POC doesn't deal with shedding leases. To me, that is fine, as we have https://github.com/cockroachdb/cockroach/issues/77035 for that, and also even without that, an exponential backoff on application seems like an improvement over the status quo to me.

There are some other TODOs in the code (e.g. the POC does a linear backoff only right now). I am sure I am missing rather obvious things, but I thought it'd be fun to write a POC over breather work.

However, application is very unlikely to succeed again after the restart, so I'd argue that it's better to not restart automatically and wait for someone to fix the underlying problem and restart the node.

I hear you on this, but I still wonder if exponential backoff is a nice thing to do here. It eliminates the need to make a judgement call about the likelihood of an apply retry succeeding post server restart. If it turns out that there are some cases where a retry will actually succeed without operator involvement (e.g. temporary issues with infra like disks), that will happen quickly, because early in the backoff, the backoff time is very short. If it turns out that we do need an operator, the backoff time will quickly grow very large, large enough that the impact of the crashes is quite small (e.g. one crash per 10m... even one crash per 30m). The fact that one policy handles both these cases gracefully is one of the reasons I love a good ol' exponential backoff.

There is also some precedent in the CRDB job system for following this pattern: https://github.com/cockroachdb/cockroach/issues/44594

erikgrinaker commented 2 years ago

I hear you on this, but I still wonder if exponential backoff is a nice thing to do here. It eliminates the need to make a judgement call about the likelihood of an apply retry succeeding post server restart. If it turns out that there are some cases where a retry will actually succeed without operator involvement (e.g. temporary issues with infra like disks), that will happen quickly, because early in the backoff, the backoff time is very short. If it turns out that we do need an operator, the backoff time will quickly grow very large, large enough that the impact of the crashes is quite small (e.g. one crash per 10m... even one crash per 30m). The fact that one policy handles both these cases gracefully is one of the reasons I love a good ol' exponential backoff.

If the goal here is to limit the blast radius of application errors, I think we need to prevent a range problem from becoming a node problem. A node crash is very disruptive: in the best case, it will cause up to 10 seconds of unavailability for leases on that node and abort all transactions that it's the gateway for. The failure is also likely to happen across all replicas, so it would crash multiple nodes, potentially taking out the whole cluster. That disruption does not seem worth it on the off-chance that a restart fixes the problem -- application is designed to be deterministic, so a failure is also likely to be deterministic.

You do bring up a good point about e.g. disk failures and such that might be temporary, limited to a single component, and could resolve themselves. There are multiple failure modes here that we should try to handle. I think what we should do is:

Cordon off the replica on application failures (when it is safe to do so), preventing all access to it, but leave the node running. This also needs to shed the lease, ideally passively. If the failure did turn out to be a temporary blip on a single replica, then when we cordon off the replica the range should eventually detect this, remove the replica from the range, and upreplicate elsewhere -- this would automatically handle these random, recoverable failures.
Have a higher-level failure detector that detects correlated failures (e.g. application failures across many ranges on a single node/DC or application failures across all replicas of a single range) and takes appropriate action. There's been some talk about building improved failure detection in general, e.g. better node/replica liveness, detecting disk failures, flakiness and latency fluctuations, etc -- this should probably be integrated with that.

joshimhoff commented 2 years ago

Two meta Qs:

Want to 1:1 bout this? Or other forums for design discussions?
I think I will keep playing with the POC as thru it & this discussion, I am learning stuff during breather week, my main goal. I entirely understand that you favor a different approach, and so the POC I'm writing might not be the right thing in the end.

That disruption does not seem worth it on the off-chance that a restart fixes the problem -- application is designed to be deterministic, so a failure is also likely to be deterministic.

I know you already said this, but thanks for saying it again, as I am used to spaces where retries do have a good shot at fixing the issue!!

I think the real reason that I like the exponential backoff approach is that with it there is no need to handle the multiple failures modes you list above separately. A relatively simple backoff if recent panic component running locally on each node leads to a reduction in blast radius for the cluster as a whole. Even if the chances of a retry fixing the issue are very low, if it is not zero, we do want to handle that case, and to me at least it's nice that the solution I am pushing for handles that case and the more common repeated failure on application, with a relatively low amount of complexity (assuming that the POC I wrote is on the right track (I ack that it might not be lol lol)).

I hear you pointing out that panics are disruptive and so questioning that we are getting the reduction in blast radius we want. Here's an idea:

Cap the exponential backoff on a certain number of crashes + retries. After that many, stop crashing + retrying. In this way we bound the impact to a certain number of crashes per node.

I wonder if you will not like this because the initial crashes will still cause node wide impact? I'd argue this difference in POV is a values thing as opposed to a purely technical issue. That <= 3 nodes will experience a bounded number of crashes in case of a rare event like application panic / failure doesn't bother me. To me, the blast radius reduction of what I suggest compared to the status quo is quite significant, and shooting for ~zero crashes isn't motivating.

If I try to be more objective about that feeling, what I'd say is that CRDB will be reliable enough for our customers even if a node experiences a bounded number of crashes in case of a rare event like application failure. I am particularly motivated to not ever have extended multi-tenant outages in serverless (a goal I really believe in), and a bounded number of crashes means the outage will affect a single-tenant only (assuming affected range is in tenant key space) once the crash limit is hit.

erikgrinaker commented 2 years ago

Want to 1:1 bout this?

Sure, ping me when you get in tomorrow.

there is no need to handle the multiple failures modes you list above separately

Cordoning alone will handle both of these failure modes better:

Local temporary failure
- Cordoning: replica is removed from range and range recovers.
- Restart: node restarts once and range recovers.
Range-wide persistent failure
- Cordoning: range is unavailable.
- Restart: 3 nodes restart repeatedly, range is unavailable.

It isn't clear to me why we would choose to restart instead.

The failure detector idea is to try and improve things beyond these measures, e.g. with a borked range.

I wonder if you will not like this because the initial crashes will still cause node wide impact? I'd argue this difference in POV is a values thing as opposed to a purely technical issue. That <= 3 nodes will experience a bounded number of crashes in case of a rare event like application panic / failure doesn't bother me. To me, the blast radius reduction of what I suggest compared to the status quo is quite significant, and shooting for ~zero crashes isn't motivating.

My question would be what do the restarts buy us. It doesn't seem like they really buy us anything, so why do it?

I am particularly motivated to not ever have extended multi-tenant outages in serverless (a goal I really believe in), and a bounded number of crashes means the outage will affect a single-tenant only (assuming affected range is in tenant key space) once the crash limit is hit.

I don't think this is true? If a range has a deterministic application failure (which is what happened in the outage that motivated this), then all replicas of that range will cause node crashes at the same time. If 3 nodes go down, presumably all other tenants that have >=2 range replicas on those nodes will also be affected?

joshimhoff commented 2 years ago

I don't think this is true? If a range has a deterministic application failure (which is what happened in the outage that motivated this), then all replicas of that range will cause node crashes at the same time. If 3 nodes go down, presumably all other tenants that have >=2 range replicas on those nodes will also be affected?

Yes, while the nodes are crashing, there will be impact on other tenants. Based on your feedback, I was imagining a world tho where there is a limit to the number of crashes enforced in the backoff code (the bullet up above about that starts with "cap the exponential backoff on a certain number of crashes + retries". Once that limit is hit, no more crashes. In that world, the outage will stop affecting multiple tenants eventually, without operator involvement.

But anyway thanks for laying out the reasons for preferring cordoning so clearly!!! Looking forward to chatting more soon.

tbg commented 2 years ago

Just a heads up that I'm following the discussion. Don't have much to add to Erik's points. It would be nice to avoid the need for a per-Replica lease transfer at least for a POC. Perhaps we can achieve that by gossiping (up to a certain number) the defunct replica descriptors (rangeid:replicaid:leaseseq pairs), and have the invariant that if a pair is gossiped as defunct, they're no longer using their lease (i.e. it can be acquired even if node is still live). The same signal could be used by the allocator to remove the offending replica from the range, at which point it ceases to exist (pending replicaGC, which we should try to make tolerant to errors).

joshimhoff commented 2 years ago

@erikgrinaker and I just chatted!

I'm gonna POC:

If apply error or panic, crash once (will need to write to storage or similar to implement crash once (something similar to https://github.com/cockroachdb/cockroach/pull/77858 would work)).
On startup after the crash, "cordon" the range. Concretely, this means the raft scheduler doesn't schedule the replica. Note that this is a new concept, as the 22.1 circuit breaker fails writes but doesn't de-schedule the replica at the raft scheduler level.

One restart is not a big deal at a reliability level IMHO. We also think it side steps the need to actively shed the lease (see the above stuff from @tbg about gossip (can we avoid using gossip??)), or implement replica liveness (https://github.com/cockroachdb/cockroach/issues/77035), which is a big (& exciting) project. It may also be that cordoning on startup is simpler than cordoning post startup (e.g. no need to cancel anything).

We also think it side steps the need to actively shed the lease

This does confuse me a little bit though now that I am writing the idea down: Does a single restart def shed the lease? That is, if KV starts up again so fast that no liveness heartbeat is missed, could the lease remain on the restarted node?

If answer to ^^ is "one restart may not shed the lease", enough restarts would, which suggests that an exponential backoff of sorts may be a way to improve the status quo (even tho the restarts do cause impact) without biting off https://github.com/cockroachdb/cockroach/issues/77035. Could even restart until the lease is shed (doesn't sound elegant).

erikgrinaker commented 2 years ago

To add a bit more context of how I see this playing out:

Long term, the ideal solution would be to cordon the replica and have it passively lose its lease and get removed from the range, by some form of replica-level heartbeats/liveness. Cordoning involves blocking all read/write access to the range (including cancelling in-flight requests) and preventing it from being scheduled -- probably using a new type of circuit breaker (#77366).

Medium term, gossiping the faulty replicas might be a cheaper way to shed the lease and get removed from the range. We'll still need the same cordoning mechanism.

Short-term, a single restart and then cordoning the replica would avoid the lease shedding issue and cancellation of in-flight requests. It would still need to cordon the replica, and ideally get it removed from the range (but the latter may be more work than it's worth, at least for a PoC). There also needs to be a way to un-cordon the replica when the problem has been resolved.

We'll discuss how much of this we want to address for the 22.2 cycle.

This does confuse me a little bit though now that I am writing the idea down: Does a single restart def shed the lease? That is, if KV starts up again so fast that no liveness heartbeat is missed, could the lease remain on the restarted node?

No, I think this will give the node a new liveness epoch, so it won't be able to resume any of its existing leases.

joshimhoff commented 2 years ago

No, I think this will give the node a new liveness epoch, so it won't be able to resume any of its existing leases.

Ah got it!

joshimhoff commented 2 years ago

OK, here is an new POC, @erikgrinaker & @tbg: https://github.com/cockroachdb/cockroach/pull/78092

I haven't tested on a multi-node cluster, but I wrote the following integration test to test the behavior on a single node cluster. Was very satisfying. I love the CRDB test infra!!!

func TestCordonIfPanicDuringApply(t *testing.T) {
    defer leaktest.AfterTest(t)()
    defer log.Scope(t).Close(t)

    skip.UnderShort(t)

    ctx := context.Background()

    dir, cleanupFn := testutils.TempDir(t)
    defer cleanupFn()

    testArgs := func(dontPanicOnApplyPanicOrFatalError bool) base.TestServerArgs {
        return base.TestServerArgs{
            Settings: cluster.MakeClusterSettings(),
            // We will start up a second server after stopping the first one to
            // simulate a server restart, since cordoning in response to a panic
            // during apply only takes effect after a server restart. So we use
            // the same store for the two servers (they are the same node). This
            // way, the should cordon write to storage done by the first server is
            // seen by the second server at startup.
            StoreSpecs: []base.StoreSpec{{Path: dir + "/store-1"}},
            Knobs: base.TestingKnobs{
                Store: &kvserver.StoreTestingKnobs{
                    DontPanicOnApplyPanicOrFatalError: dontPanicOnApplyPanicOrFatalError,
                    // Simulate a panic during apply!
                    TestingApplyFilter: func(args kvserverbase.ApplyFilterArgs) (int, *roachpb.Error) {
                        for _, ru := range args.Req.Requests {
                            key := ru.GetInner().Header().Key
                            // The time-series range is continuously written to.
                            if bytes.HasPrefix(key, keys.TimeseriesPrefix) {
                                panic("boom")
                            }
                        }
                        return 0, nil
                    },
                },
            },
        }
    }

    s, _, kvDB := serverutils.StartServer(t,
        // In production, the first time the panic at apply time is experienced, we expect the
        // server to mark the replica as to be cordoned and crash after re-throwing the panic.
        // In this test, we expect the server to mark the replicas as to be cordoned but also
        // to NOT re-throw the panic. Else the test would fail due to a uncaught panic.
        testArgs(true /* dontPanicOnApplyPanicOrFatalError */))

    // TODO(josh): This is gnarly. Can we force the scheduler to run on the range with the
    // apply time panic, instead of sleeping here?
    time.Sleep(10 * time.Second)
    s.Stopper().Stop(ctx)

    s, _, kvDB = serverutils.StartServer(t,
        // On the second run of the server, we don't expect a panic, as on startup, the replica
        // should be cordoned. All raft machinery should stop, so the TestingApplyFilter up above
        // shouldn't run.
        testArgs(false /* dontPanicOnApplyPanicOrFatalError */))
    defer s.Stopper().Stop(ctx)

    time.Sleep(10 * time.Second)

    // On the second run of the server, we expect requests to the cordoned range to fail fast.
    //
    // Note that if this was a three node cluster, and if only one replica was failing to apply
    // some entry, we'd expect that one replica to cordon on restart, which would imply shedding
    // the lease. As a result, we'd expect reads & writes to the range to succeed, rather than
    // fail fast.
    // TODO(josh): Test behavior on a three node cluster.
    _, err := kvDB.Get(ctx, keys.TimeseriesPrefix)
    require.Error(t, err)
    require.Regexp(t, "cordoned", err.Error())
}

tbg commented 2 years ago

We also discussed cordoning in this weekly KV/Repl Eng meeting.

erikgrinaker commented 9 months ago

Adding O-support here, since this was a major factor in a large outage.