elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
70k stars 24.76k forks source link

Snapshot recovery operation sometimes logged with wrong type #91854

Closed pheyos closed 1 year ago

pheyos commented 1 year ago

Found in version

Steps to reproduce

Expected result

Actual result

Additional information

elasticsearchmachine commented 1 year ago

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner commented 1 year ago

Things have definitely changed in this area in 8.6.0 in ways that might make this more common, but I think this could have happened in earlier versions too: we might restore a shard onto instance-0000000001 but while that's happening we have always been free to decide to rebalance it onto instance-0000000000 and instance-0000000002, removing the copy on instance-0000000001 and leaving only the two copies with recovery type peer in the recovery API output.

You cannot rely on seeing a recovery with type snapshot in these situations, although in practice it might well have been present much of the time in earlier versions. Can you help us understand why you would need to see this? If you're waiting for the recovery to complete, it should be enough to wait for the index health to be green.

pheyos commented 1 year ago

Thanks for providing the details about how it comes to that situation @DaveCTurner!

in practice it might well have been present much of the time in earlier versions

Indeed. We went with this approach for more than two years now with multiple runs per day and it worked fine.

You cannot rely on seeing a recovery with type snapshot in these situations

With your explanation, I understand how this is happening. But from a UX perspective I think it's not ideal and actually looks like a bug. The _recovery docs say Use the index recovery API to get information about ongoing and completed shard recoveries. and for the response body type:

SNAPSHOT
A snapshot. Indicates recovery is related to a snapshot restore operation.

So a user would expect to see this type for snapshot restore operations. But with the process you described and the result of no snapshot entry left, the information that this was coming from a snapshot restore is completely lost.

Can you help us understand why you would need to see this?

It's true that we don't necessarily need this and there are other ways to do it. It's currently implemented that way because

  1. the documentation says it's working like this
  2. it worked up until recently

so there was no reason to doubt the approach.

If you're saying that this is behaving as intended and you don't plan to change that, I'd suggest to update the documentation to make it clear that a snapshot restore operation doesn't necessarily leave a snapshot recovery entry.

DaveCTurner commented 1 year ago

If you're saying that this is behaving as intended and you don't plan to change that, I'd suggest to update the documentation to make it clear that a snapshot restore operation doesn't necessarily leave a snapshot recovery entry.

Yes I think these docs are lacking and can be improved as you suggest. See https://github.com/elastic/elasticsearch/pull/91861.

See also https://github.com/elastic/elasticsearch/issues/60747 which would let you see older recoveries too.

pheyos commented 1 year ago

With the incoming docs update and a potential future ability to see older recoveries, I'm closing this issue. Thanks for your quick responses @DaveCTurner!