api: make an API endpoint that returns success when a node can be safely shut down

piyush-singh commented 4 years ago

Part of Tobi's comments in #44206 included making it much simpler for users to know when it is safe to take a node offline as part of a rolling restart, rolling upgrade, etc.

We should introduce an HTTP endpoint that users can monitor to know when it is safe to take the next node offline. The endpoint should return a success message when it is safe, otherwise it should return an error. This is a frequent customer request, so we have many users we can discuss this with to make sure our proposed solution works for their automation.

cc @knz

Epic: CRDB-8501

Jira issue: CRDB-3633

knz commented 4 years ago

cc @bdarnell @tbg @andreimatei -- can you outline as comment here what would be the 80% solution for this?

tbg commented 4 years ago

Copying the relevant part from my comment over on #44206 on what we're trying to avoid:

"take a node offline" is reasonably easy - remove from load balancers; initiate a graceful shutdown; hard-kill after a generous timeout. But "restart the node" is trickier - what to wait for before taking down the next node? Operators would certainly wait for the readiness/health endpoints, but those essentially greenlight the node once it is live again (as measured by node liveness). This is not enough. Consider a range in a cluster that lives on three nodes n1, n2, and n3. n1 is taken offline first; n2 and n3 continue to accept write traffic. Let's say they're going at full speed, meaning that when n1 joins them again it will take it seconds (potentially a dozen or so seconds) until it has caught up on the raft log. Now n1 is restarted and marks itself as live. It begins catching up on the log, but won't be up to date for another 10s or so. n2 gets taken down; now the range is unavailable: we have n3 (which has the latest entries and is now leader) but it can't commit anything until n1 has caught up, which will take at least another couple of seconds. From the operator's point of view, large write latencies are observed on this range.

First, it's worth pointing out that this is a property of restarting a node, not of shutting it down. That is, we can't hope to get an API endpoint on node X that tells us when node X is ready to shut down. Rather, we'll have an API endpoint on the previously restarted node Y that tells us when it's "ready" (i.e. among other things, will pull its weight at the replication layer). That endpoint should probably be health?ready=true, which is also nice - users are likely already watching that.

The "endpoint" @piyush-singh is asking for is then just making sure that all nodes in the cluster show green for health?ready=true. We could wrap that into a cluster-wide endpoint for convenience to abstract away from the per-node level: get the list of nodes in the cluster that are expected to be around (i.e. all non-decommissioned ones) and show green if and only if each node in that list is green for health?ready=true. We can call it a cluster-wide ready. Such an endpoint would also be a natural fit for exposing cluster-wide alerts in the future, but I will leave that part of the API design to others.

For the actual "meat" of the eng work, it seems as though we need to introduce a way to teach health?ready=true to have a notion of "have I caught up on anything I missed since I was last offline".

In some shape or form, this will entail learning, for each replica on the range, and at the point of node start, the last index of the range (only surely known by the raft leader). Regular raft traffic does not include this information as far as I know, though at least when the node comes back, it will be contacted by the leaders:

https://github.com/cockroachdb/cockroach/blob/f7c95588cdb35649d4c35ccaffe20efad62647f4/pkg/kv/kvserver/store_raft.go#L529-L560

A simple solution would be to include the last index in each raft heartbeat. On each *Replica, we store the index we get on the first heartbeat received by that incarnation. Then, we delay marking the node as ready until all replicas have a last index >= first received leader's last index.

I assume that if we leave it at that, there will be edge cases where the node never marks itself as ready because the leader never unquiesces (or the unquiesce heartbeat gets lost, etc). So at least after some time passes, we will want to be proactive about unquiescing the stragglers.

However, in the hopefully common case, the system will work without wasting any time: the newly restarted node's new liveness record is gossiped; the leaseholders react to this by unquiescing their ranges, which sends a heartbeat (ok, it won't do that until the timer fires next time, so there's a bit of wasting time here but we have coalesced heartbeats and thus a buffering delay anyway) and that heartbeat tells the follower what to wait for; at the same time the follower will respond and thus will be appended to.

An end-to-end test of this would be running the above thought experiment: three node cluster, take one node down, leave it down for a while while running heavy load on a single range (or a small number, the point is, we want long and byte-heavy raft logs), bring it back up and immediately stop another node when the first one declares itself ready. There should be no significant latency spike (assuming inter-node latencies are small).

bdarnell commented 3 years ago

It's important not to think of this as primarily an upgrade-related issue. If at any point it is unsafe to take down a node, your cluster is not meeting its fault tolerance goals because it would be disrupted by a crash that could happen at any time. So whatever we come up with here should be primarily intended to be monitored continuously, and any use of it in the upgrade pipeline is a bonus. I think what's needed here is something like the critical locality report, but at the node level.

can you outline as comment here what would be the 80% solution for this?

The 80% solution is to do nothing but sleep. Personally I wouldn't bother building something like what's discussed here into an update tool. If, a few minutes after restarting a node, you're not ready to restart another, your other monitoring should have caught the problem.

Restarting a node is a low-risk operation. Even if you restart them too quickly, the cluster will recover on its own. Polling some status endpoint just invites false negatives if you don't wait long enough for the status endpoint to detect that there's a problem.

andreimatei commented 3 years ago

First, it's worth pointing out that this is a property of restarting a node, not of shutting it down. That is, we can't hope to get an API endpoint on node X that tells us when node X is ready to shut down. Rather, we'll have an API endpoint on the previously restarted node Y that tells us when it's "ready" (i.e. among other things, will pull its weight at the replication layer). That endpoint should probably be health?ready=true, which is also nice - users are likely already watching that.

"It" is not a property of restarting a node, I don't think. I think we'd be well advised to discriminate between two things:

a node being ready to serve traffic, indicated by health?ready=true.
a node being caught up to replication for all its ranges, and thus able to momentarily become part of the write quorums. This would best be indicated by the other nodes, which nodes wouldn't be "critical" any more.

So I'm with Ben on the criticality thinking, with the nuance that we should also discriminate between degrees of criticality: a node is really critical if it has one of the last two live replicas for a range. A node is critical-ish if it is has one of the two up-to-date replicas of a range, with the 3rd one being alive but behind.

The 80% solution is to do nothing but sleep. Personally I wouldn't bother building something like what's discussed here into an update tool. If, a few minutes after restarting a node, you're not ready to restart another, your other monitoring should have caught the problem.

Well, it would be nice to have tooling that generally lets you go through restarts much faster than a node every "few minutes".

knz commented 3 years ago

Well, it would be nice to have tooling that generally lets you go through restarts much faster than a node every "few minutes".

Agreed with Andrei here. There entire discussion started because users are asking "how much waiting is enough?"

For some of them waiting a minute or less is OK. For others (many ranges and/or lots of activity and a restart taking more than a few minutes, so there's much to catch up on), 10 minutes is a minimum. We can't just tell everyone "do 10 minutes".

(I feel this is more or less the "drain" discussion all over again, but on the other side of a restart.)

bdarnell commented 3 years ago

It is closely related to the draining discussion, and I tend to come to the same conclusions (be patient and don't overcomplicate. fixed sleeps are better than clever detection logic. rely on the same monitoring you use outside of upgrades).

First, let me clarify that we should absolutely be relying on the /health?ready=1 endpoint of the node that was just restarted. Until that is returning OK, the restart of the previous node isn't complete.

But this issue is asking for a new endpoint which tells you something different: after the previous restart is complete, how soon can you restart the next node. That's where I think it's getting too complicated and a fixed rate is better.

It's true that relying on the implicit ability to withstand a single node crash at any time prevents you from parallelizing your restarts, so you need to serially wait for drain time and restart time on every node. In very large clusters you'd want to parallelize this, but I think that's a much harder problem (or a much easier one - just drain traffic from an AZ and upgrade the whole thing at once?)

knz commented 3 years ago

@piyush-singh can you remind us who the internal customer is here? Is this important for CC? If so can someone from the CC team chime in about what is needed, and their reaction to Ben's comments?

joshimhoff commented 3 years ago

I think I agree with Ben here. If all nodes are live (which can be checked by sending /health?ready=1 to all nodes), and also if there are fixed sleeps of ~3m between restarts (this is how the CC automation works), then it is most likely okay to restart the next node. If it's not okay, we should be able to detect this post-short-blip-in-availability with monitoring (e.g. maybe with a kvprober that does writes). At that point, we can prioritize more work to protect against the raft group behind failure mode but until then we don't really have any evidence that failure mode actually leads to customer pain. Right? Unless we actually do have evidence that the raft group behind failure mode has hit us in some production environment? Anyone? Graceful drain may have issues but do we really think this failure mode is the issue (as opposed to issues like https://github.com/cockroachdb/cockroach/issues/66103 (or even more basic issues like simply not sending SIGTERM or not waiting long enough for drain))?

Are there other benefits to the proposed endpoint? Or is the one goal to protect against the raft group behind failure mode?

My initial feeling is the endpoint has unclear value to CC but I think a little more discussion will help figure this out.

joshimhoff commented 3 years ago

Well, it would be nice to have tooling that generally lets you go through restarts much faster than a node every "few minutes".

I guess this is one additional benefit. I think it's fair to say that update runtime is not a CC side issue yet but that's just one data point.

github-actions[bot] commented 1 year ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!

cockroachdb / cockroach

api: make an API endpoint that returns success when a node can be safely shut down #55768