kv: partial/asymmetric partition that does not isolate node liveness does not force range lease revocation

kalman5 commented 4 years ago

ERROR: internal error while retrieving user account

Please describe the issue you observed, and any steps we can take to reproduce it:

Setup a 6 nodes cluster (3 on "datacenter" A, 3 on "datacenter" B (172.31.2.0/24)), set default replication zone to num_replicas = 6, create a table with a num_replica = 6 as well.

Create a iptables rule on a node of A such that it can not reach any node on B.

sudo iptables -I OUTPUT -d 172.31.2.0/24 -j DROP

Trying to connect to that node I get:

$ cockroach-v20.1.0.linux-amd64/cockroach sql --insecure --url postgresql://kalman@192.168.26.157:26257/testdb
#
# Welcome to the CockroachDB SQL shell.
# All statements must be terminated by a semicolon.
# To exit, type: \q.
#

Connecting to another node works, but issuing a command like: select count(*) from test2;

it hangs forever, as soon I dropped that rule with:

sudo iptables -D OUTPUT -d 172.31.2.0/24 -j DROP

the query in hang completed and now I'm able to connect to that node without issues

Environment:

CockroachDB version: 20.1
Server OS: Linux Ubuntu 18.04
Client app: cockroach sql or psql give the same error

Additional context What was the impact? 1 Node isolated from other 3 nodes the entire cluster looks like is not working

For the dashboard all 6 nodes are available.

gz#7690

gz#8203

gz#8949

gz#10844

Jira issue: CRDB-4243

gz#14119

gz#14290

Epic CRDB-25199

gz#16890

blathers-crl[bot] commented 4 years ago

Hello, I am Blathers. I am here to help you get the issue triaged.

It looks like you have not filled out the issue in the format of any of our templates. To best assist you, we advise you to use one of these templates.

I have CC'd a few people who may be able to assist you:

@ricardocrdb (member of the technical support engineering team)

If we have not gotten back to your issue within a few business days, you can try the following:

Join our community slack channel and ask on #cockroachdb.
Try find someone from here if you know they worked closely on the area and CC them.

_{:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

yuzefovich commented 4 years ago

cc @RichardJCai

RichardJCai commented 4 years ago

Do you happen to have the logs up until the error is given?

kalman5 commented 4 years ago

Do you happen to have the logs up until the error is given?

I can repeat the experiment whenever you want, you can find me on slack "Gaetano", what bothers
me is the entire cluster unavailability

kalman5 commented 4 years ago

This is what I see in the log of that node isolated (it can still reach 2 nodes on the same datacenter).

crdb_report1.txt

I'm not able to get an internal error while connecting to the cluster, however a simple select SELECT count(*) FROM test; on my table hangs forever untill I don't remove the firewall rule

darinpp commented 4 years ago

I'm able to reproduce this.

nvanbenschoten commented 4 years ago

I'm able to reproduce this as well and this seems like a real issue. The problem occurs because at least one range in the table you are querying, test2, has a leaseholder on the partially partitioned node (let's call it P). The gateway you are querying is unable to reach this leaseholder on P, so the query hangs.

The question is - why isn't this leaseholder moving to a different node that is not partially partitioned? At first, I thought this was due to the quorum size of 6. Typically, even-numbered quorums are fragile because they can result in voting ties during leader elections. If all nodes in DC A voted for P and all nodes in DC B voted for one of the other nodes, the election would result in a draw and would need to be run again. However, I tested this out on a 7 node cluster and saw the same thing. That's because we aren't even calling an election for the inaccessible range on P.

The problem here seems to be that we delegate liveness tracking to a separate range in the system ("node liveness"). This NodeLiveness range may have a different set of replicas than the table range. More importantly, it may have a different leaseholder. It turns out that there are two cases we can get into here:

the node liveness leaseholder in DC A
the node liveness leaseholder in DC B

In the case where the node liveness leaseholder is in DC B, the partitioned node P is unable to heartbeat its liveness record. Its lease on the table range then effectively expires, which allows one of the other nodes to call an election and then acquire the lease. The whole thing recovers in about 15s and then we're back to normal.

However, in the case where the node liveness leaseholder is in DC A, things are more tricky. P is still able to heartbeat its liveness record, so it keeps effectively extending its lease over the table range. The NodeLiveness leaseholder doesn't know that P is partitioned, so it happy allows it to continue heartbeating its record. Meanwhile, the data range itself is quiesced and no-one is trying to write to it, so none of the replicas are expecting Raft heartbeats. So the leaseholder just stays put and the query hangs while trying to connect to P.

I'm not actually sure how this class of partial failures is supposed to be handled. It's easy enough to understand in standard Raft, but the two-level liveness hierarchy we have makes it harder to reason about. @bdarnell do you know what we expect to happen here? Should we be restricting the conditions under which we allow a node to heartbeat its liveness record - similar to what we did with nodes that are unable to fsync? We currently use node-liveness as a proxy to determine suitability to hold leases, but it doesn't do a good enough job recognizing partial failures like this, and I'm not sure that it can.

Note that I don't think we have this kind of issue with ranges that use standard expiration-based leases.

bdarnell commented 4 years ago

Unfortunately we can't handle this like we did with nodes that can't fsync by forbidding liveness heartbeats from nodes that are partitioned. What exactly would make a node consider itself partitioned? Even a single unreachable peer could cause hanging queries if that peer happened to be the gateway. In this case node P can only reach a minority of peers, which seems on its face to be a reasonable threshold, but I think if we require a node to have a majority of its peers live in order to come back online we'll probably have trouble recovering from outages.

Solving network partitions by detecting the partition is a losing game - a sufficiently complex partition can always defy your efforts to figure out what's going on (and in the limit, we're OK with that - we're a CAP-consistent database, not a CAP-available one, and there will be situations in which we're unable to remain available. We just want to handle "reasonable" classes of partitions, for some definition of "reasonable").

We used to have a general mechanism for this, the SendNextTimeout in DistSender. This would cause DistSender to try a different node if the one that it thought was the leaseholder was failing to respond. This was removed in #16088 and some follow-on commits because (IIRC) the additional retries were leading to cascading failures in overloaded clusters, and it was redundant with GRPC-level heartbeats.

Assuming those GRPC-level heartbeats are working, the gateway node should be periodically dropping its connection to P and trying on another replica. But as long as P is still live, it'll just get redirected back to P. We should be using these requests as a trigger to unquiesce the range, allow the other replicas of the range to detect P's failure, and then forcibly invalidate its liveness record.

kalman5 commented 4 years ago

To be honest when I did this experiment I just expected CRDB to form a mesh network such that P would reach nodes in the other datacenter via not partitioned nodes. A mesh network would be a nice abstraction in CRDB solving this kind of problems by construction.

darinpp commented 4 years ago

CC @tbg

andreimatei commented 3 years ago

Unfortunately we can't handle this like we did with nodes that can't fsync by forbidding liveness heartbeats from nodes that are partitioned. What exactly would make a node consider itself partitioned? Even a single unreachable peer could cause hanging queries if that peer happened to be the gateway. In this case node P can only reach a minority of peers, which seems on its face to be a reasonable threshold, but I think if we require a node to have a majority of its peers live in order to come back online we'll probably have trouble recovering from outages.

We have an option in between a node not heartbeating itself and the node continuing to hold its leases: the node in question can increment its own epoch, thereby relinquishing all its leases. But the twist is that it'd immediately take back all its leases, except for the range(s) it believes to have an inaccessible quorum. I mean, the node would attempt to take back all the leases, but it wouldn't succeed for the partitioned ranges.

daniel-crlabs commented 3 years ago

Hi guys, is there an update on this? This is impacting another customer (please see zendesk 8203), I'm not putting their name here as this is a public repo.

andreimatei commented 3 years ago

No update. I think it's unlikely that there's going to be any movement here soon, cause it seems hard. However, note that this issue is about partial partitions. Zendesk 8203 seems to me to be about a full partition; with full partitions things are supposed to be more clear.

smcvey commented 2 years ago

This has affected another customer - please see zendesk 10495.

RoachietheSupportRoach commented 2 years ago

Zendesk ticket #10484 has been linked to this issue.

RoachietheSupportRoach commented 2 years ago

Zendesk ticket #10495 has been linked to this issue.

a-robinson commented 2 years ago

This is a lot less useful without a specific repro, but I saw a somewhat similar outage caused by an asymmetric partition recently that had me wondering why no other node took the affected lease from the partitioned node. It's interestingly different in that the partially partitioned nodes did lose their node liveness but one or more range leases got stuck on them anyway.

Concise-ish series of events:

Cluster spread across 5 AZs with 5 replicas of all ranges
AZ x started seeing seriously elevated outbound network latencies to all other AZs from most (but not all) of its nodes
When this happened, all the nodes with latency problems lost their liveness (as expected)
1. As evidence of this, they repeatedly logged failed node liveness heartbeat: operation "node liveness heartbeat" timed out after 4.5s: context deadline exceeded and all of the healthy nodes in the cluster reported a liveness_livenodes metric that was decreased by exactly the number of nodes suffering from connectivity issues
One of those nodes with bad connectivity held the range lease for the system config range (r6).
For the duration of the problem (which lasted tens of minutes until the nodes in the AZ with connectivity problems were manually stopped), no other replica attempted/managed to take the range lease for r6.
1. The node in AZ x kept complaining have been waiting 1m0s attempting to acquire lease for r6
2. The replicas of r6 in other AZs kept complaining about their closed timestamps falling behind -- e.g. RangeFeed closed timestamp 1638974354.518526211,0 is behind by 1m9.554085818s
3. Nodes that needed to access the system config range (e.g. to manage SQL table leases) complained about hung requests -- e.g. slow range RPC: have been waiting 62.26s (1 attempts) for RPC [txn: 4dfaefe9], [can-forward-ts], Get [/Table/3/1/11/2/1,/Min) to r6:/Table/{SystemConfigSpan/Start-11}.
  1. Sometimes these logs ended with a root cause of network connectivity problems -- e.g. resp: failed to send RPC: sending to all replicas failed; last error: unable to dial n39: breaker open (where n39 is the node in AZ x that previously held r6's lease)
  2. Sometimes they ended with a root cause of the other replicas returning a NotLeaseHolderError -- e.g. resp: failed to send RPC: sending to all replicas failed; last error: routing information detected to be stale; lastErr: [NotLeaseHolderError] r6: replica (n43,s43):1482 not lease holder; current lease is repl=(n39,s39):1479 seq=0 start=0,0 exp=<nil>

My question here is why did no other replica attempt to take the lease? The best guess I can manage with my hazy memory of how things work is that the broken node in AZ x was holding onto raft leadership despite having lost its liveness and that that was somehow preventing other processes from taking the lease on the range. But I'm curious if there's something else going on or if any progress has been made on these sorts of situations since I last knew.

In case it matters, this was on v20.2.

tbg commented 2 years ago

Hi Alex,

that's a good question. r6 has an epoch-based lease and from what you're describing it looks as though the r6 leaseholder in AZ x couldn't have been maintaining liveness throughout the incident. I was wondering whether https://github.com/cockroachdb/cockroach/blob/e17b557ae254f0536e5bada69e60dd3ff4ff661b/pkg/kv/kvserver/replica_proposal_buf.go#L452-L461 (introduced in #55148) could have to do with it but that doesn't appear to be present in 20.2 yet. On the nodes that should've obtained the lease, were you seeing any "have been waiting [...] attempting to acquire lease" messages? I assume you're not at liberty to share the logs privately, but if you are able to, I could take a look (maybe ./cockroach debug merge-logs --filterr6[^0-9]` and some more search-replace of keys can do something? Or you have redactable logs but not even sure this is supported in 20.2).

As for the scenario, if we wanted to reproduce this, would the following seem to represent your situation reasonably well:

set up five-node cluster, wait for up-replication, move r6 leader to n1
on n1, set up a significant delay/packet loss for outbound connections to n2,n3,n4 (note: not to n5 to spice things up)
check that all ranges are available

a-robinson commented 2 years ago

Thanks for the response, Tobi!

(introduced in #55148) could have to do with it but that doesn't appear to be present in 20.2 yet.

Sorry, I should have been more specific -- this was on v20.2.13, which does appear to contain that change. So maybe?

On the nodes that should've obtained the lease, were you seeing any "have been waiting [...] attempting to acquire lease" messages?

No. It was only ever logged by the unhealthy node (for r6, that is).

if you are able to, I could take a look

Sure, I'll redact anything needed from the logs and send them over to your email.

As for the scenario, if we wanted to reproduce this, would the following seem to represent your situation reasonably well:

Yeah, although it's hard to say whether there weren't any additional factors at play. If it helps, the p90/p99 latency as self-reported by the round_trip_latency histogram metric was just over 6 seconds. The p50s weren't terrible, it was really only around p80-85 and above where reported latencies from the unhealthy nodes skyrocketed.

tbg commented 2 years ago

I took a look and yes, this is starting to look as though other replicas may have denied themselves the lease based on #55148. I say this because in the logs, we see these messages:

resp: failed to send RPC: sending to all replicas failed; last error: routing information detected to be stale; lastErr: [NotLeaseHolderError] r6: replica (n443,s443):1482 not lease holder; current lease is repl=(n396,s396):1479 seq=0 start=0,0 exp=

This is notably a partially populated lease (start and exp are missing) and this is precisely what you get if you hit this code

https://github.com/cockroachdb/cockroach/blob/0c0957babe4191bbc46256bc6e368a7b35d970de/pkg/kv/kvserver/replica_proposal_buf.go#L502-L517

which calls down to construct a NotLeaseholderError here:

https://github.com/cockroachdb/cockroach/blob/0c0957babe4191bbc46256bc6e368a7b35d970de/pkg/kv/kvserver/replica_proposal_buf.go#L779-L794

Note the speculativeLease which has only the Replica but nothing else.

a-robinson commented 2 years ago

Thanks for digging in and confirming. I agree that that looks quite likely.

It seems as though this isn't a situation for which there's a trivial fix, but hopefully knowing about it helps inspire some improvements in the future. I'm happy to chat and bounce ideas back and forth anytime, but I also doubt my ideas will be as helpful or relevant as they may have been in the past :)

joshimhoff commented 2 years ago

Note that I don't think we have this kind of issue with ranges that use standard expiration-based leases.

Based on a discussion with @tbg about the CheckQuorum option being unset on master (https://github.com/cockroachlabs/support/issues/1520#issuecomment-1092105429), I think it's possible our raft groups will NOT survive certain one way network partitions (where the raft leader can send heartbeats but not receive any network traffic). As a result, IIUC, ranges with expiration-based leases also will become unavailable in the face of certain one way network partitions. I've tried to show this is true with https://github.com/cockroachdb/cockroach/pull/79699. It's VERY possible I've not simulated what I meant to simulate, given my lack of experience with KV internals! Still, I think I see different and "better" behavior with the CheckQuorum option set (the raft leader on the partitioned node steps down to follower according to the logs). So maybe I got the repro right enough??? See https://github.com/cockroachdb/cockroach/pull/79699 for more.

Let's assume that with CheckQuorum enabled, our raft groups DO survive one way network partitions that allow the leader to send on the network but not receive. To me, this suggests a way to resolve this ticket here: Piggy-back on the raft group's ability to survive these network conditions. Has this idea already been discussed? Sort of follows from what @nvanbenschoten said earlier in this ticket:

It's easy enough to understand in standard Raft

Going back to the beginning of this ticket:

However, in the case where the node liveness leaseholder is in DC A, things are more tricky. P is still able to heartbeat its liveness record, so it keeps effectively extending its lease over the table range. The NodeLiveness leaseholder doesn't know that P is partitioned, so it happy allows it to continue heartbeating its record. Meanwhile, the data range itself is quiesced and no-one is trying to write to it, so none of the replicas are expecting Raft heartbeats. So the leaseholder just stays put and the query hangs while trying to connect to P.

With CheckQuorum set, P will step down as raft leader, leaving a log line like this:

4/logs/cockroach.crlMBP-C02CC5FRMD6TMjYz.joshimhoff.2022-04-08T20_18_07Z.094978.log:1044:W220408 20:21:22.520510 346 vendor/go.etcd.io/etcd/raft/v3/raft.go:996 ⋮ [n4,s4,r2/2:‹/System/NodeLiveness{-Max}›] 975  2 stepped down to follower since quorum is not active

We would then do something like this:

When a replica's raft group steps down to follower due to a non-active quorum, the replica should also expire and/or transfer its range lease (assuming it has the range lease).

It's a little unclear to me if ^^ is sufficient. Could the partitioned node re-grab the range lease before being un-partitioned? It can't take raft leadership since it is receiving no network traffic. But I don't understand range leases concretely enough to answer my own Q. @andreimatei says some stuff that sort of sounds related here: https://github.com/cockroachdb/cockroach/issues/49220#issuecomment-756444001

IIUC, one (major?) issue with this suggestion is, since heartbeating is needed to step down after losing quorum:

Meanwhile, the data range itself is quiesced and no-one is trying to write to it, so none of the replicas are expecting Raft heartbeats.

Perhaps every N minutes, a quiesced range should be woken up? We already do this in CC land, since we are KV probing all ranges, regardless of whether they are quiesced. Then the outage would be resolved in max ~ N minutes long.

I ack that ^^ is not particularly satisfying...

I am more trying to the sketch the idea of "piggy-back on the raft group's ability to survive certain one way partitions" than propose a concrete approach. Does the idea have any merit to it?

Apologies if there are major misunderstandings in what I've written and/or in the linked "repro": https://github.com/cockroachdb/cockroach/pull/79699!!

bdarnell commented 2 years ago

Based on a discussion with @tbg about the CheckQuorum option being unset on master

The reason CheckQuorum is disabled is that it requires idle leader ranges to track the passage of time, thus raising the cost of an idle range and reducing our data capacity per node. (and because the most common problem solved by CheckQuorum is also solved by PreVote, in a way that is less expensive to us. But that does leave gaps in our coverage for asymmetric network partitions). Simply setting CheckQuorum to true may not have the expected results due to replica quiescence (back before PreVote, we used CheckQuorum with a bunch of additional hacks to make it work, but all that code is gone).

Our intention has been to handle this kind of partition with failure detectors other than CheckQuorum. For example, the heartbeats in pkg/rpc cover some scenarios.

Perhaps every N minutes, a quiesced range should be woken up?

The replica scanner already does this (at least for the consistency checker mode? I can't remember if any of the other queues are guaranteed to wake a quiesced range).

I am more trying to the sketch the idea of "piggy-back on the raft group's ability to survive certain one way partitions" than propose a concrete approach. Does the idea have any merit to it?

Failure detection at the raft level is expensive because there are many raft groups per node. In general, we prefer to move failure detection away from raft. I think it's better to build on node-level constructs like pkg/rpc than range-level ones like raft. Currently the rpc heartbeats only detect partitions in one direction; they need something analogous to CheckQuorum to detect asymmetric problems. Roughly that means "If I haven't received heartbeats from all the nodes I expect in the last X seconds, something's wrong and I should consider giving up my leases". The catch is that the expectations are unclear (in contrast to raft which has well-defined replica sets). In the limit we tend towards a fully-connected RPC graph, but that's not guaranteed (nor, I think, would we want it to be). So we'd need to do something like look at the union of all our raft groups to decide which heartbeats we should worry about.

joshimhoff commented 2 years ago

That all makes sense. Thanks for sharing, Ben. Seems like a very rich design space, with a lot of possible options. It's interesting that raft is a good failure detector in the concrete sense that we know what we can expect from it, but also raft is expensive given how we use it. It sort of feels to me like there might be options that leverage raft as a failure detector while NOT being as expensive? Some ideas that are prob bad:

Change the concept of quiescience to keep running raft but send much less heartbeats per unit time, and enable CheckQuorum? Reducing recovery time but also resource usage...
Group ranges into <replica, replica, replica> triplets (noted that replication level is configurable...) & have a singe dedicated "failure detection only" raft group for each triplet, with CheckQuorum enabled? Sounds complicated... but maybe so is the RPC heartbeat approach?

I ack that I don't understand the system in enough detail to ascertain to what extent the RPC heartbeat approach can be tweaked to survive more scenarios...

At an even higher level, another feeling I'm getting is it's okay to not survive certain hardware failures, so long as reliability is good enough overall in the environments we care about. For example, the failure mode described in this ticket has never happened in CC, IIUC. OTOH, CC is admittedly still pretty low scale.

bdarnell commented 2 years ago

Raft isn't an especially great failure detector (even with CheckQuorum). It works fine, but once you move away from just using raft as-is to having a separate failure detection raft group, there's not much reason to have the separate failure detector be raft based. The hard part is in coordinating the sets of nodes found in each individual raft group with the nodes monitored by the failure detector, and using raft for both doesn't really make that any easier.

For example, the failure mode described in this ticket has never happened in CC, IIUC.

Hasn't happened yet (and even the first message in this thread was about a simulation with iptables and not an actual event), but it could one day. And perhaps more importantly, I think the underlying issue of incomplete failure detection is implicated in a lot of our issues of node-level problems having a larger blast radius than expected.

joshimhoff commented 2 years ago

The hard part is in coordinating the sets of nodes found in each individual raft group with the nodes monitored by the failure detector, and using raft for both doesn't really make that any easier.

Got it.

And perhaps more importantly, I think the underlying issue of incomplete failure detection is implicated in a lot of our issues of node-level problems having a larger blast radius than expected.

That makes a lot of sense also.

smcvey commented 1 year ago

A variant of this issue affected another customer yesterday. In this instance, there was an 8 node cluster. Due to a networking issue, nodes 3 and 7 could not communicate with each other, but both nodes could communicate with the rest of the cluster. The debug shows the usual error messages regarding RPC timeouts, unable to server requests for ranges etc. ZD ticket is 14899.

andrewbaptist commented 1 year ago

For async partitions - there is work in progress for 23.1. This doesn't handle all cases of partial partitions, but will handle 1-way partitions much better than today: https://github.com/cockroachdb/cockroach/issues/84289

cockroachdb / cockroach

kv: partial/asymmetric partition that does not isolate node liveness does not force range lease revocation #49220