Open kalman5 opened 4 years ago
Hello, I am Blathers. I am here to help you get the issue triaged.
It looks like you have not filled out the issue in the format of any of our templates. To best assist you, we advise you to use one of these templates.
I have CC'd a few people who may be able to assist you:
If we have not gotten back to your issue within a few business days, you can try the following:
:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.
cc @RichardJCai
Do you happen to have the logs up until the error is given?
Do you happen to have the logs up until the error is given?
I can repeat the experiment whenever you want, you can find me on slack "Gaetano", what bothers
me is the entire cluster unavailability
This is what I see in the log of that node isolated (it can still reach 2 nodes on the same datacenter).
I'm not able to get an internal error while connecting to the cluster, however a simple select SELECT count(*) FROM test; on my table hangs forever untill I don't remove the firewall rule
I'm able to reproduce this.
I'm able to reproduce this as well and this seems like a real issue. The problem occurs because at least one range in the table you are querying, test2
, has a leaseholder on the partially partitioned node (let's call it P). The gateway you are querying is unable to reach this leaseholder on P, so the query hangs.
The question is - why isn't this leaseholder moving to a different node that is not partially partitioned? At first, I thought this was due to the quorum size of 6. Typically, even-numbered quorums are fragile because they can result in voting ties during leader elections. If all nodes in DC A voted for P and all nodes in DC B voted for one of the other nodes, the election would result in a draw and would need to be run again. However, I tested this out on a 7 node cluster and saw the same thing. That's because we aren't even calling an election for the inaccessible range on P.
The problem here seems to be that we delegate liveness tracking to a separate range in the system ("node liveness"). This NodeLiveness range may have a different set of replicas than the table range. More importantly, it may have a different leaseholder. It turns out that there are two cases we can get into here:
In the case where the node liveness leaseholder is in DC B, the partitioned node P is unable to heartbeat its liveness record. Its lease on the table range then effectively expires, which allows one of the other nodes to call an election and then acquire the lease. The whole thing recovers in about 15s and then we're back to normal.
However, in the case where the node liveness leaseholder is in DC A, things are more tricky. P is still able to heartbeat its liveness record, so it keeps effectively extending its lease over the table range. The NodeLiveness leaseholder doesn't know that P is partitioned, so it happy allows it to continue heartbeating its record. Meanwhile, the data range itself is quiesced and no-one is trying to write to it, so none of the replicas are expecting Raft heartbeats. So the leaseholder just stays put and the query hangs while trying to connect to P.
I'm not actually sure how this class of partial failures is supposed to be handled. It's easy enough to understand in standard Raft, but the two-level liveness hierarchy we have makes it harder to reason about. @bdarnell do you know what we expect to happen here? Should we be restricting the conditions under which we allow a node to heartbeat its liveness record - similar to what we did with nodes that are unable to fsync? We currently use node-liveness as a proxy to determine suitability to hold leases, but it doesn't do a good enough job recognizing partial failures like this, and I'm not sure that it can.
Note that I don't think we have this kind of issue with ranges that use standard expiration-based leases.
Unfortunately we can't handle this like we did with nodes that can't fsync by forbidding liveness heartbeats from nodes that are partitioned. What exactly would make a node consider itself partitioned? Even a single unreachable peer could cause hanging queries if that peer happened to be the gateway. In this case node P can only reach a minority of peers, which seems on its face to be a reasonable threshold, but I think if we require a node to have a majority of its peers live in order to come back online we'll probably have trouble recovering from outages.
Solving network partitions by detecting the partition is a losing game - a sufficiently complex partition can always defy your efforts to figure out what's going on (and in the limit, we're OK with that - we're a CAP-consistent database, not a CAP-available one, and there will be situations in which we're unable to remain available. We just want to handle "reasonable" classes of partitions, for some definition of "reasonable").
We used to have a general mechanism for this, the SendNextTimeout in DistSender. This would cause DistSender to try a different node if the one that it thought was the leaseholder was failing to respond. This was removed in #16088 and some follow-on commits because (IIRC) the additional retries were leading to cascading failures in overloaded clusters, and it was redundant with GRPC-level heartbeats.
Assuming those GRPC-level heartbeats are working, the gateway node should be periodically dropping its connection to P and trying on another replica. But as long as P is still live, it'll just get redirected back to P. We should be using these requests as a trigger to unquiesce the range, allow the other replicas of the range to detect P's failure, and then forcibly invalidate its liveness record.
To be honest when I did this experiment I just expected CRDB to form a mesh network such that P would reach nodes in the other datacenter via not partitioned nodes. A mesh network would be a nice abstraction in CRDB solving this kind of problems by construction.
CC @tbg
Unfortunately we can't handle this like we did with nodes that can't fsync by forbidding liveness heartbeats from nodes that are partitioned. What exactly would make a node consider itself partitioned? Even a single unreachable peer could cause hanging queries if that peer happened to be the gateway. In this case node P can only reach a minority of peers, which seems on its face to be a reasonable threshold, but I think if we require a node to have a majority of its peers live in order to come back online we'll probably have trouble recovering from outages.
We have an option in between a node not heartbeating itself and the node continuing to hold its leases: the node in question can increment its own epoch, thereby relinquishing all its leases. But the twist is that it'd immediately take back all its leases, except for the range(s) it believes to have an inaccessible quorum. I mean, the node would attempt to take back all the leases, but it wouldn't succeed for the partitioned ranges.
Hi guys, is there an update on this? This is impacting another customer (please see zendesk 8203), I'm not putting their name here as this is a public repo.
No update. I think it's unlikely that there's going to be any movement here soon, cause it seems hard. However, note that this issue is about partial partitions. Zendesk 8203 seems to me to be about a full partition; with full partitions things are supposed to be more clear.
This has affected another customer - please see zendesk 10495.
Zendesk ticket #10484 has been linked to this issue.
Zendesk ticket #10495 has been linked to this issue.
This is a lot less useful without a specific repro, but I saw a somewhat similar outage caused by an asymmetric partition recently that had me wondering why no other node took the affected lease from the partitioned node. It's interestingly different in that the partially partitioned nodes did lose their node liveness but one or more range leases got stuck on them anyway.
Concise-ish series of events:
failed node liveness heartbeat: operation "node liveness heartbeat" timed out after 4.5s: context deadline exceeded
and all of the healthy nodes in the cluster reported a liveness_livenodes
metric that was decreased by exactly the number of nodes suffering from connectivity issueshave been waiting 1m0s attempting to acquire lease
for r6RangeFeed closed timestamp 1638974354.518526211,0 is behind by 1m9.554085818s
slow range RPC: have been waiting 62.26s (1 attempts) for RPC [txn: 4dfaefe9], [can-forward-ts], Get [/Table/3/1/11/2/1,/Min) to r6:/Table/{SystemConfigSpan/Start-11}
.
resp: failed to send RPC: sending to all replicas failed; last error: unable to dial n39: breaker open
(where n39 is the node in AZ x that previously held r6's lease)NotLeaseHolderError
-- e.g. resp: failed to send RPC: sending to all replicas failed; last error: routing information detected to be stale; lastErr: [NotLeaseHolderError] r6: replica (n43,s43):1482 not lease holder; current lease is repl=(n39,s39):1479 seq=0 start=0,0 exp=<nil>
My question here is why did no other replica attempt to take the lease? The best guess I can manage with my hazy memory of how things work is that the broken node in AZ x was holding onto raft leadership despite having lost its liveness and that that was somehow preventing other processes from taking the lease on the range. But I'm curious if there's something else going on or if any progress has been made on these sorts of situations since I last knew.
In case it matters, this was on v20.2.
Hi Alex,
that's a good question. r6 has an epoch-based lease and from what you're describing it looks as though the r6 leaseholder in AZ x couldn't have been maintaining liveness throughout the incident. I was wondering whether https://github.com/cockroachdb/cockroach/blob/e17b557ae254f0536e5bada69e60dd3ff4ff661b/pkg/kv/kvserver/replica_proposal_buf.go#L452-L461 (introduced in #55148) could have to do with it but that doesn't appear to be present in 20.2 yet. On the nodes that should've obtained the lease, were you seeing any "have been waiting [...] attempting to acquire lease" messages? I assume you're not at liberty to share the logs privately, but if you are able to, I could take a look (maybe ./cockroach debug merge-logs --filter
r6[^0-9]` and some more search-replace of keys can do something? Or you have redactable logs but not even sure this is supported in 20.2).
As for the scenario, if we wanted to reproduce this, would the following seem to represent your situation reasonably well:
Thanks for the response, Tobi!
(introduced in #55148) could have to do with it but that doesn't appear to be present in 20.2 yet.
Sorry, I should have been more specific -- this was on v20.2.13, which does appear to contain that change. So maybe?
On the nodes that should've obtained the lease, were you seeing any "have been waiting [...] attempting to acquire lease" messages?
No. It was only ever logged by the unhealthy node (for r6, that is).
if you are able to, I could take a look
Sure, I'll redact anything needed from the logs and send them over to your email.
As for the scenario, if we wanted to reproduce this, would the following seem to represent your situation reasonably well:
Yeah, although it's hard to say whether there weren't any additional factors at play. If it helps, the p90/p99 latency as self-reported by the round_trip_latency
histogram metric was just over 6 seconds. The p50s weren't terrible, it was really only around p80-85 and above where reported latencies from the unhealthy nodes skyrocketed.
I took a look and yes, this is starting to look as though other replicas may have denied themselves the lease based on #55148. I say this because in the logs, we see these messages:
resp: failed to send RPC: sending to all replicas failed; last error: routing information detected to be stale; lastErr: [NotLeaseHolderError] r6: replica (n443,s443):1482 not lease holder; current lease is repl=(n396,s396):1479 seq=0 start=0,0 exp=
This is notably a partially populated lease (start and exp are missing) and this is precisely what you get if you hit this code
which calls down to construct a NotLeaseholderError
here:
Note the speculativeLease
which has only the Replica
but nothing else.
Thanks for digging in and confirming. I agree that that looks quite likely.
It seems as though this isn't a situation for which there's a trivial fix, but hopefully knowing about it helps inspire some improvements in the future. I'm happy to chat and bounce ideas back and forth anytime, but I also doubt my ideas will be as helpful or relevant as they may have been in the past :)
Note that I don't think we have this kind of issue with ranges that use standard expiration-based leases.
Based on a discussion with @tbg about the CheckQuorum
option being unset on master (https://github.com/cockroachlabs/support/issues/1520#issuecomment-1092105429), I think it's possible our raft groups will NOT survive certain one way network partitions (where the raft leader can send heartbeats but not receive any network traffic). As a result, IIUC, ranges with expiration-based leases also will become unavailable in the face of certain one way network partitions. I've tried to show this is true with https://github.com/cockroachdb/cockroach/pull/79699. It's VERY possible I've not simulated what I meant to simulate, given my lack of experience with KV internals! Still, I think I see different and "better" behavior with the CheckQuorum
option set (the raft leader on the partitioned node steps down to follower according to the logs). So maybe I got the repro right enough??? See https://github.com/cockroachdb/cockroach/pull/79699 for more.
Let's assume that with CheckQuorum
enabled, our raft groups DO survive one way network partitions that allow the leader to send on the network but not receive. To me, this suggests a way to resolve this ticket here: Piggy-back on the raft group's ability to survive these network conditions. Has this idea already been discussed? Sort of follows from what @nvanbenschoten said earlier in this ticket:
It's easy enough to understand in standard Raft
Going back to the beginning of this ticket:
However, in the case where the node liveness leaseholder is in DC A, things are more tricky. P is still able to heartbeat its liveness record, so it keeps effectively extending its lease over the table range. The NodeLiveness leaseholder doesn't know that P is partitioned, so it happy allows it to continue heartbeating its record. Meanwhile, the data range itself is quiesced and no-one is trying to write to it, so none of the replicas are expecting Raft heartbeats. So the leaseholder just stays put and the query hangs while trying to connect to P.
With CheckQuorum
set, P will step down as raft leader, leaving a log line like this:
4/logs/cockroach.crlMBP-C02CC5FRMD6TMjYz.joshimhoff.2022-04-08T20_18_07Z.094978.log:1044:W220408 20:21:22.520510 346 vendor/go.etcd.io/etcd/raft/v3/raft.go:996 ⋮ [n4,s4,r2/2:‹/System/NodeLiveness{-Max}›] 975 2 stepped down to follower since quorum is not active
We would then do something like this:
When a replica's raft group steps down to follower due to a non-active quorum, the replica should also expire and/or transfer its range lease (assuming it has the range lease).
It's a little unclear to me if ^^ is sufficient. Could the partitioned node re-grab the range lease before being un-partitioned? It can't take raft leadership since it is receiving no network traffic. But I don't understand range leases concretely enough to answer my own Q. @andreimatei says some stuff that sort of sounds related here: https://github.com/cockroachdb/cockroach/issues/49220#issuecomment-756444001
IIUC, one (major?) issue with this suggestion is, since heartbeating is needed to step down after losing quorum:
Meanwhile, the data range itself is quiesced and no-one is trying to write to it, so none of the replicas are expecting Raft heartbeats.
Perhaps every N minutes, a quiesced range should be woken up? We already do this in CC land, since we are KV probing all ranges, regardless of whether they are quiesced. Then the outage would be resolved in max ~ N minutes long.
I ack that ^^ is not particularly satisfying...
I am more trying to the sketch the idea of "piggy-back on the raft group's ability to survive certain one way partitions" than propose a concrete approach. Does the idea have any merit to it?
Apologies if there are major misunderstandings in what I've written and/or in the linked "repro": https://github.com/cockroachdb/cockroach/pull/79699!!
Based on a discussion with @tbg about the CheckQuorum option being unset on master
The reason CheckQuorum is disabled is that it requires idle leader ranges to track the passage of time, thus raising the cost of an idle range and reducing our data capacity per node. (and because the most common problem solved by CheckQuorum is also solved by PreVote, in a way that is less expensive to us. But that does leave gaps in our coverage for asymmetric network partitions). Simply setting CheckQuorum to true may not have the expected results due to replica quiescence (back before PreVote, we used CheckQuorum with a bunch of additional hacks to make it work, but all that code is gone).
Our intention has been to handle this kind of partition with failure detectors other than CheckQuorum. For example, the heartbeats in pkg/rpc
cover some scenarios.
Perhaps every N minutes, a quiesced range should be woken up?
The replica scanner already does this (at least for the consistency checker mode? I can't remember if any of the other queues are guaranteed to wake a quiesced range).
I am more trying to the sketch the idea of "piggy-back on the raft group's ability to survive certain one way partitions" than propose a concrete approach. Does the idea have any merit to it?
Failure detection at the raft level is expensive because there are many raft groups per node. In general, we prefer to move failure detection away from raft. I think it's better to build on node-level constructs like pkg/rpc than range-level ones like raft. Currently the rpc heartbeats only detect partitions in one direction; they need something analogous to CheckQuorum to detect asymmetric problems. Roughly that means "If I haven't received heartbeats from all the nodes I expect in the last X seconds, something's wrong and I should consider giving up my leases". The catch is that the expectations are unclear (in contrast to raft which has well-defined replica sets). In the limit we tend towards a fully-connected RPC graph, but that's not guaranteed (nor, I think, would we want it to be). So we'd need to do something like look at the union of all our raft groups to decide which heartbeats we should worry about.
That all makes sense. Thanks for sharing, Ben. Seems like a very rich design space, with a lot of possible options. It's interesting that raft is a good failure detector in the concrete sense that we know what we can expect from it, but also raft is expensive given how we use it. It sort of feels to me like there might be options that leverage raft as a failure detector while NOT being as expensive? Some ideas that are prob bad:
CheckQuorum
? Reducing recovery time but also resource usage...CheckQuorum
enabled? Sounds complicated... but maybe so is the RPC heartbeat approach?I ack that I don't understand the system in enough detail to ascertain to what extent the RPC heartbeat approach can be tweaked to survive more scenarios...
At an even higher level, another feeling I'm getting is it's okay to not survive certain hardware failures, so long as reliability is good enough overall in the environments we care about. For example, the failure mode described in this ticket has never happened in CC, IIUC. OTOH, CC is admittedly still pretty low scale.
Raft isn't an especially great failure detector (even with CheckQuorum). It works fine, but once you move away from just using raft as-is to having a separate failure detection raft group, there's not much reason to have the separate failure detector be raft based. The hard part is in coordinating the sets of nodes found in each individual raft group with the nodes monitored by the failure detector, and using raft for both doesn't really make that any easier.
For example, the failure mode described in this ticket has never happened in CC, IIUC.
Hasn't happened yet (and even the first message in this thread was about a simulation with iptables and not an actual event), but it could one day. And perhaps more importantly, I think the underlying issue of incomplete failure detection is implicated in a lot of our issues of node-level problems having a larger blast radius than expected.
The hard part is in coordinating the sets of nodes found in each individual raft group with the nodes monitored by the failure detector, and using raft for both doesn't really make that any easier.
Got it.
And perhaps more importantly, I think the underlying issue of incomplete failure detection is implicated in a lot of our issues of node-level problems having a larger blast radius than expected.
That makes a lot of sense also.
A variant of this issue affected another customer yesterday. In this instance, there was an 8 node cluster. Due to a networking issue, nodes 3 and 7 could not communicate with each other, but both nodes could communicate with the rest of the cluster. The debug shows the usual error messages regarding RPC timeouts, unable to server requests for ranges etc. ZD ticket is 14899.
For async partitions - there is work in progress for 23.1. This doesn't handle all cases of partial partitions, but will handle 1-way partitions much better than today: https://github.com/cockroachdb/cockroach/issues/84289
ERROR: internal error while retrieving user account
Please describe the issue you observed, and any steps we can take to reproduce it:
Setup a 6 nodes cluster (3 on "datacenter" A, 3 on "datacenter" B (172.31.2.0/24)), set default replication zone to num_replicas = 6, create a table with a num_replica = 6 as well.
Create a iptables rule on a node of A such that it can not reach any node on B.
sudo iptables -I OUTPUT -d 172.31.2.0/24 -j DROP
Trying to connect to that node I get:
Connecting to another node works, but issuing a command like: select count(*) from test2;
it hangs forever, as soon I dropped that rule with:
sudo iptables -D OUTPUT -d 172.31.2.0/24 -j DROP
the query in hang completed and now I'm able to connect to that node without issues
Environment:
Additional context What was the impact? 1 Node isolated from other 3 nodes the entire cluster looks like is not working
For the dashboard all 6 nodes are available.
gz#7690
gz#8203
gz#8949
gz#10844
Jira issue: CRDB-4243
gz#14119
gz#14290
Epic CRDB-25199
gz#16890