SLM Handles Master Failovers Poorly (Retries Cause Confusing Situations)

original-brownbear commented 4 years ago

Since SLM uses the Client instead of using SnapshotsService directly to start snapshots we see some confusing and unwanted behavior on master fail-over events in practice.

If the current master fails over during snapshot creation then the following will happen for example:

The snapshot create action will retry internally (TransportMasterNodeAction does that)
The snapshot is then executed on the new master (as a result of the create request from the previous master)
The snapshot works out but the previous master fails to record this because it's not the master any longer (hence cannot run the cluster state update task to record the successful snapshot)
(In addition to the above) Meanwhile, the new master could concurrently try to run another snapshot (or delete) but that will fail because it's already executing the retry from the previous master

There's other strange spots around using the Client to orchestrate SLM actions. Another example would be deletes: A master node that failed over could go through the whole retention process without failing. Since all its actions through the Client will still work out (causing the previous master to block the actual SLM tasks on the new master for a potentially long time via a delete loop over multiple snapshots).

=> I think we must stop using the Client (it will always retry things on master fail-over due to the way TransportMasterNodeAction works) to orchestrate SLMs actions and move to using the SnapshotsService directly, making sure that we instantly bail out of SLM tasks if we encounter any master-failover type of exception at any step in an SLM lifecycle or retention task.

elasticmachine commented 4 years ago

Pinging @elastic/es-core-features (:Core/Features/ILM+SLM)

dakrone commented 4 months ago

This has been open for quite a while, and we haven't made much progress on this due to focus in other areas. For now I'm going to close this as something we aren't planning on implementing. We can re-open it later if needed.

elastic / elasticsearch

SLM Handles Master Failovers Poorly (Retries Cause Confusing Situations) #56328