Open psypuff opened 4 years ago
Wow. This is indeed a strange discovery. I would never have believed that a GET _snapshot/
API call would result in an IN_PROGRESS
snapshot state. I still find it difficult to believe this is happening. What version of Elasticsearch is it? I'd like to script up some ways to confirm this in past and present releases and see what I can discover.
ES Version: 6.8.2
Just to clarify - this is not actually resulting in an IN_PROGRESS
state, it's a Curator-specific check safe_to_snap
that among other things calls find_snapshot_tasks
which checks for snapshot-related active tasks and outputs the misleading error even if it just stumbles upon a cluster:admin/snapshot/get
task.
ES is fine, as mentioned before we succeeded deleting snapshots by bypassing Curator and calling the _snapshot/
API directly. This is just a too-strict check on Curator side that outputs the error and can be relaxed IMO.
ES Version: 7.1.0
Same result as @psypuff. IMO, this should be relaxed a bit. This is from the documentation:
When a snapshot is deleted from a repository, Elasticsearch deletes all files that are associated with the deleted snapshot and not used by any other snapshots. If the deleted snapshot operation is executed while the snapshot is being created the snapshotting process will be aborted and all files created as part of the snapshotting process will be cleaned. Therefore, the delete snapshot operation can be used to cancel long running snapshot operations that were started by mistake.
Best regards,
Same issue with ES 6.8.6 and Curator 5.8.1
Having the same issue. Has anyone discovered the cause of the constantly executing cluster:admin/snapshot/get
tasks that cause the curator snapshot deletes to fail ?
Having the same issue as well using ES 6.8 and curator 5.6.0. Any updates are known workarounds regarding this issue?
In my case the source of cluster:admin/snapshot/get
tasks was prometheus exporter justwatch/elasticsearch_exporter with short es.timeout (8 seconds) and short scrape interval (10 seconds) in serviceMonitor. Increasing timeout up to 30 seconds fixes exporter behavior.
Expected Behavior
delete_snapshots
action shouldn't fail if there's a read-only snapshot-related task (e.g.cluster:admin/snapshot/get
).Actual Behavior
delete_snapshots
action fails with the following error:Steps to Reproduce the Problem
delete_snapshots
action.Specifications
Context (Environment)
We have a scheduled retention job for our ES backups, it started to fail on one of our clusters recently with the error mentioned above. According to Curator's code it might happen if there are snapshots in progress or if there's an active snapshot-related task. In our case there are no snapshots in progress but there are active
cluster:admin/snapshot/get
tasks all the time. We're still investigating the source of those tasks (it didn't happen before) but it shouldn't block Curator from deleting snapshots IMO.Detailed Description
We're able to delete snapshots using the ES API directly so unless there's a good reason to block this action on all snapshot-related tasks I think it's safe to exclude read-only tasks here. Not sure if only
cluster:admin/snapshot/get
tasks should be excluded or there are more potential tasks worth excluding. WDYT?