Make snapshots more resilient to node departure

DaveCTurner commented 3 years ago

If a node holding a primary shard leaves the cluster then one of the replica shards is immediately promoted to primary to replace the failed copy. Today if there was a snapshot ongoing when the promotion happens then the corresponding shard-level snapshot fails and the overall snapshot status is at best PARTIAL. This is a problem for graceful shutdowns (https://github.com/elastic/elasticsearch/issues/70338), which ideally would not result in any such failures. In cases where a replica is promoted to replace a failed primary it would be better instead to retry the shard-level snapshot on the new primary.

This isn't the first time this idea has come up:

https://github.com/elastic/elasticsearch/blob/9addf0b6ed0ca06b67e641ccbc0b6525d418f5ee/server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java#L1091

elasticmachine commented 3 years ago

Pinging @elastic/es-distributed (Team:Distributed)

original-brownbear commented 3 years ago

I think theoretically there aren't many issues with this for repositories of meta version 7.6+ (that use uuid shard generations and thus don't have any troublesome races if two data nodes write to the same shard).

There's definitely some practical complexities to this though (e.g. SLM running snapshots at a high frequency and this now causing snapshots to slow down significantly if a hot index fails over where the snapshot may not be incremental at all due to the changed file structure). I guess we can discuss those elsewhere, but in general we should definitely try to do this IMO. It would make the behavior of snapshots with large shard counts a lot nicer.

DaveCTurner commented 3 years ago

There's definitely some practical complexities to this though

Aren't there always? :grin:

Doesn't the hot-index-failing-over case already affect us today? Its impact is currently on the next snapshot to be taken so this would just move it forward a bit. IMO a bigger concern is a snapshot that simply never ends because one of the primaries keeps on moving around. Certainly solvable, but not something that we need worry about today.

original-brownbear commented 3 years ago

a bigger concern is a snapshot that simply never ends because one of the primaries keeps on moving around

That's the even worse version of this for sure (worst case the snapshot would be the reason for the shard failure I guess ...).

Doesn't the hot-index-failing-over case already affect us today? Its impact is currently on the next snapshot to be taken so this would just move it forward a bit.

It certainly does. I think the "moving forward" is a good thing though. If anything it at least takes away the snapshot load from the cluster when a data node fails and shards may get relocated putting extra pressure on things. If we start retrying a primary failover will trigger more work to be done right away on the data node and (if SLM is running in a tight loop) additional snapshots piling up. Hard to quantify the impact of this but it seems like it would make things worse for a lot of clusters that are already on the edge. As always ... all of this isn't blocking. We could certainly make big improvements to the way we track state for multiple snapshots and how we execute them to make this much less of a concern.

DaveCTurner commented 3 years ago

We discussed this in the @elastic/es-distributed team sync today.

We hadn't implemented this already because it risked corrupting the repo, but in newer versions that's no longer true so we can safely implement it. We did not reach a conclusion on exactly when to retry and when to give up. There is a trade-off between bounding the duration of the snapshot vs avoiding failure states in a graceful shutdown.

One possibility would be have the retry behaviour be sensitive to which nodes are gracefully shutting down.

DaveCTurner commented 3 years ago

We (the @elastic/es-distributed team) discussed this again today. We think retrying finitely often (once by default) per node (identified by persistent node ID) would be sufficient protection against snapshots that just never finish.

In addition we discussed a way to abort a snapshot without losing all the work it already did and opened https://github.com/elastic/elasticsearch/issues/75340 to pursue that idea further.

DaveCTurner commented 1 year ago

This plan turns out to be pretty tricky to fit into the snapshots state machine as it currently stands today. The assumption that shards don't move while being snapshotted is implicit in several places deep within the convoluted logic of the SnapshotsService, and I am not confident we will find them all. I'm going to try a different approach, integrating things more closely with graceful node shutdown and avoiding the need to handle other kinds of node departure:

Abort shard snapshots on shutting-down nodes

Adjust the SnapshotShardsService to notice that the local node has a shutdown marker and treat that as if all its ongoing shard snapshots are aborted. When each one completes, respond indicating it should move back to state WAITING rather than going to FAILED.

I suspect this won't work (or at least it will break some invariants) in older versions, so if the min cluster version is too old then skip this logic.

Stop moving shards from WAITING/QUEUED to INIT state on shutting-down nodes

Adjust the SnapshotsService to treat primaries assigned to shutting-down nodes as if they were unassigned.

I think with these two changes we will react well to graceful shutdowns, stopping shard snapshot work ASAP without causing a PARTIAL snapshot, permitting the shards to relocate elsewhere, and resuming the shard snapshot work on more appropriate nodes.

I don't think we need to put extra effort into limiting the retries with this plan, as suggested in an earlier comment. As long as the node shutdown markers aren't flapping and the shards eventually relocate or the node restarts, each snapshot will complete.

If we cannot relocate a shard onto a new node then the snapshot will also never complete. However I believe that we treat this as a failure case in other ways too, eventually timing out and either stopping the node or else removing the shutdown marker, either of which permits the snapshot to proceed.

ETA: I've put this comment on the wrong issue. This is a solution for #101171 which avoids the need to solve #71333 at the same time.

fcofdez commented 1 year ago

I think with these two changes we will react well to graceful shutdowns, stopping shard snapshot work ASAP without causing a PARTIAL snapshot, permitting the shards to relocate elsewhere, and resuming the shard snapshot work on more appropriate nodes.

You mentioned that it's expected that shards don't move while being snapshotted implicitly, I wonder if the state transition from INIT back to WAITING instead of FAILED is also implicitly forbidden. Otherwise this approach makes sense to me.

DaveCTurner commented 1 year ago

I wonder if the state transition from INIT back to WAITING instead of FAILED is also implicitly forbidden.

Yes, I think it is, although my first attempts to go in the direction I suggest indicate that this assumption is not quite so deeply embedded in the logic.

henningandersen commented 1 year ago

This approach makes sense to me.

DaveCTurner commented 1 year ago

Moving the shards back to WAITING on abort seems to be no problem after all, which is nice :)

https://github.com/elastic/elasticsearch/compare/main...DaveCTurner:elasticsearch:2023/10/30/pause-snapshot-on-shutdown

ywangd commented 1 year ago

Thanks for the code diff which gives me more concrete understanding on the proposal. I think it works in the context of node-shutdown. I do have some questions for your earlier comment and about how this propsoal compared to the broader node-departure scenarios.

The assumption that shards don't move while being snapshotted is implicit in several places deep within the convoluted logic of the SnapshotsService

Could you please share one or a few examples of this tanglement?

integrating things more closely with graceful node shutdown and avoiding the need to handle other kinds of node departure

This suggests node-shutdown is easier to handle compare to the more generic node-depature. I think this makes sense intuitive. But I cannot quite articulate why. Is this because we can preemptively react to the shutdown request and we can assume (and rely on?) shards will come back eventually. While for a general node-departure scenario, it is unknown whether the shard would come back which leads to the "finite number of retry" issue? It seems moving shard back to WAITING state should make no difference for either node-shutdown or node-departure?

I suspect this won't work (or at least it will break some invariants) in older versions, so if the min cluster version is too old then skip this logic.

Why won't this wor with old versions? Is it because UUID shard-generation, i.e. old versions that do not use UUID shard-generation could overwrrite each other?

If we cannot relocate a shard onto a new node then the snapshot will also never complete. However I believe that we treat this as a failure case in other ways too, eventually timing out and either stopping the node or else removing the shutdown marker, either of which permits the snapshot to proceed.

How do we "eventually timing out"? Will this be new code? I am not quite sure how "stopping the node" can permit snapshot to proceed? To allow snapshot to resume when "removing the shutdown marker", the change needs to be made in SnapshotsService, right? Essentially a corollary of "Stop moving shards from WAITING/QUEUED to INIT state on shutting-down nodes"?

Appologies if some of the questions are naive. Still trying to get my head around this area. Thanks!

DaveCTurner commented 1 year ago

Moved the preceding conversation over to https://github.com/elastic/elasticsearch/issues/101171 where it is a little more relevant.

elastic / elasticsearch

Make snapshots more resilient to node departure #71333

Abort shard snapshots on shutting-down nodes

Stop moving shards from WAITING/QUEUED to INIT state on shutting-down nodes