elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.71k stars 24.67k forks source link

SLM Handles Master Failovers Poorly (Retries Cause Confusing Situations) #56328

Closed original-brownbear closed 4 months ago

original-brownbear commented 4 years ago

Since SLM uses the Client instead of using SnapshotsService directly to start snapshots we see some confusing and unwanted behavior on master fail-over events in practice.

If the current master fails over during snapshot creation then the following will happen for example:

There's other strange spots around using the Client to orchestrate SLM actions. Another example would be deletes: A master node that failed over could go through the whole retention process without failing. Since all its actions through the Client will still work out (causing the previous master to block the actual SLM tasks on the new master for a potentially long time via a delete loop over multiple snapshots).

=> I think we must stop using the Client (it will always retry things on master fail-over due to the way TransportMasterNodeAction works) to orchestrate SLMs actions and move to using the SnapshotsService directly, making sure that we instantly bail out of SLM tasks if we encounter any master-failover type of exception at any step in an SLM lifecycle or retention task.

elasticmachine commented 4 years ago

Pinging @elastic/es-core-features (:Core/Features/ILM+SLM)

dakrone commented 4 months ago

This has been open for quite a while, and we haven't made much progress on this due to focus in other areas. For now I'm going to close this as something we aren't planning on implementing. We can re-open it later if needed.