Closed original-brownbear closed 4 months ago
Pinging @elastic/es-core-features (:Core/Features/ILM+SLM)
This has been open for quite a while, and we haven't made much progress on this due to focus in other areas. For now I'm going to close this as something we aren't planning on implementing. We can re-open it later if needed.
Since SLM uses the
Client
instead of usingSnapshotsService
directly to start snapshots we see some confusing and unwanted behavior on master fail-over events in practice.If the current master fails over during snapshot creation then the following will happen for example:
TransportMasterNodeAction
does that)There's other strange spots around using the
Client
to orchestrate SLM actions. Another example would be deletes: A master node that failed over could go through the whole retention process without failing. Since all its actions through theClient
will still work out (causing the previous master to block the actual SLM tasks on the new master for a potentially long time via a delete loop over multiple snapshots).=> I think we must stop using the
Client
(it will always retry things on master fail-over due to the wayTransportMasterNodeAction
works) to orchestrate SLMs actions and move to using theSnapshotsService
directly, making sure that we instantly bail out of SLM tasks if we encounter any master-failover type of exception at any step in an SLM lifecycle or retention task.