Allow snapshot restore after write alias has been moved by ILM

matschaffer commented 3 years ago

I've seen some cases where a snapshot restore has failed with an error like this:

[illegal_state_exception] alias [matschaffer-filebeat-7.7.1] has more than one write index [matschaffer-filebeat-7.7.1-2021.03.22-000096,matschaffer-filebeat-7.7.1-2021.03.21-000095]

The sequence of events is roughly:

Data is being written to matschaffer-filebeat-7.7.1-2021.03.21-000095 via matschaffer-filebeat-7.7.1 write alias
A snapshot is taken which backs up matschaffer-filebeat-7.7.1-2021.03.21-000095 with the alias information
ILM rolls over matschaffer-filebeat-7.7.1-2021.03.21-000095 to matschaffer-filebeat-7.7.1-2021.03.21-000096 and updates the write alias
A failure occurs and matschaffer-filebeat-7.7.1-2021.03.21-000095 is lost
Restore of matschaffer-filebeat-7.7.1-2021.03.21-000095 fails because it attempts to also use the matschaffer-filebeat-7.7.1 write index, currently backed by matschaffer-filebeat-7.7.1-2021.03.21-000096

To work around this I had to perform the restore manually without aliases:

POST _snapshot/found-snapshots/cloud-snapshot-2021.03.22-UUID/_restore
{
    "indices": [
        "matschaffer-filebeat-7.7.1-2021.03.21-000095"
    ],
    "include_aliases": false
}

Then replace the read alias so the restored data would be available via normal query load:

POST _aliases
{
    "actions" : [
        { "add" : { "index" : "matschaffer-filebeat-7.7.1-2021.03.21-000095", "alias" : "matschaffer-filebeat-7.7.1", "is_write_index": false } }
    ]
}

It'd be great if restore could be more ILM-aware such that it won't try to re-claim write indices already backed by a more-current index.

elasticmachine commented 3 years ago

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner commented 3 years ago

We (the @elastic/es-distributed team) discussed possible solutions in our team meeting today. Our favourite idea was to introduce a new option that would let you preserve the aliases of an existing index rather than overwriting them or clearing them as we do today. The reasoning was that when restoring an index like this you're really trying to put its data back without changing its place in the cluster, so the aliases of the existing index are likely more useful than the aliases in the snapshot.

We discussed changing the default behaviour but decided it'd be surprising for the API to behave differently from today by default. Instead we would expect tooling that restores indices like this to use this new option explicitly.

We also discussed whether to preserve any other metadata (mappings, settings, ...) rather than overwriting them from those in the snapshot but decided that there are too many ways that such a mechanism might lead to operational surprises.

How does that sound @matschaffer?

matschaffer commented 3 years ago

Hard to say without a little more detail.

My expectation would be that you have some ability to restore matschaffer-filebeat-7.7.1-2021.03.21-000095 with only the read alias, leaving the write alias pointed to matschaffer-filebeat-7.7.1-2021.03.21-000096. In contrast to today where you get either read+write or nothing (via include_aliases: false).

If the new option would do this, then that's probably fine. It'd be good if we make this the default in Kibana's restore UI, or maybe even in elasticsearch itself.

We see this with some frequency when orchestrating snapshot restore after VM failure on non-HA indices.

DaveCTurner commented 3 years ago

On closer inspection it seems that include_aliases: false already does what we propose, preserving the aliases of the existing closed index over the top of which we're doing the restore, but the orchestration tooling isn't setting this option so its restores will often fail as described. I believe we should always use include_aliases: false when restoring an index to recover it from some misadventure that left it in red health.

matschaffer commented 3 years ago

cc @elastic/cloud-orchestration for comment/prioritization

ean5533 commented 3 years ago

I don't have a strong understanding of all the implications here, but if the recommendation from ES is to just set include_aliases: false on all snapshot restores (no conditional logic) then we can do that very easily. cc @anyasabo

anyasabo commented 3 years ago

Yep +1 here, though dave your wording here has me a little concerned.

I believe we should always use include_aliases: false when restoring an index to recover it from some misadventure that left it in red health.

Should we just always be setting include_aliases: false?

deckkh commented 3 years ago

one additional thing , that happens to us after snapshot restore. By default , it will restore the ILM policy , which means that ILM usually kicks in and removes the restored index , shortly after restore has completed , which is very annoying.

We opened a support case on this and we pretty arrived at the conclusion , that the snapshot web interface cant be used and we have since then used dev tools for this , which is kinda sad.

elastic / elasticsearch

Allow snapshot restore after write alias has been moved by ILM #73934