delete_snapshots action fail when there's an active "cluster:admin/snapshot/get" task

psypuff commented 4 years ago

Expected Behavior

delete_snapshots action shouldn't fail if there's a read-only snapshot-related task (e.g. cluster:admin/snapshot/get).

Actual Behavior

delete_snapshots action fails with the following error:

Failed to complete action: delete_snapshots.  <class 'curator.exceptions.FailedExecution'>: Unable to delete snapshot(s) because a snapshot is in state "IN_PROGRESS"

Steps to Reproduce the Problem

Execute long/never-ending read-only snapshots-related request/s.
Execute Curator with a delete_snapshots action.

Specifications

Version: 5.8.0
Platform: Docker on Ubuntu 18.04
Subsystem: Python 2.7

Context (Environment)

We have a scheduled retention job for our ES backups, it started to fail on one of our clusters recently with the error mentioned above. According to Curator's code it might happen if there are snapshots in progress or if there's an active snapshot-related task. In our case there are no snapshots in progress but there are active cluster:admin/snapshot/get tasks all the time. We're still investigating the source of those tasks (it didn't happen before) but it shouldn't block Curator from deleting snapshots IMO.

Detailed Description

We're able to delete snapshots using the ES API directly so unless there's a good reason to block this action on all snapshot-related tasks I think it's safe to exclude read-only tasks here. Not sure if only cluster:admin/snapshot/get tasks should be excluded or there are more potential tasks worth excluding. WDYT?

untergeek commented 4 years ago

Wow. This is indeed a strange discovery. I would never have believed that a GET _snapshot/ API call would result in an IN_PROGRESS snapshot state. I still find it difficult to believe this is happening. What version of Elasticsearch is it? I'd like to script up some ways to confirm this in past and present releases and see what I can discover.

psypuff commented 4 years ago

ES Version: 6.8.2

Just to clarify - this is not actually resulting in an IN_PROGRESS state, it's a Curator-specific check safe_to_snap that among other things calls find_snapshot_tasks which checks for snapshot-related active tasks and outputs the misleading error even if it just stumbles upon a cluster:admin/snapshot/get task.

ES is fine, as mentioned before we succeeded deleting snapshots by bypassing Curator and calling the _snapshot/ API directly. This is just a too-strict check on Curator side that outputs the error and can be relaxed IMO.

HarishHary commented 4 years ago

ES Version: 7.1.0

Same result as @psypuff. IMO, this should be relaxed a bit. This is from the documentation:

When a snapshot is deleted from a repository, Elasticsearch deletes all files that are associated with the deleted snapshot and not used by any other snapshots. If the deleted snapshot operation is executed while the snapshot is being created the snapshotting process will be aborted and all files created as part of the snapshotting process will be cleaned. Therefore, the delete snapshot operation can be used to cancel long running snapshot operations that were started by mistake.

Best regards,

ptlittle commented 4 years ago

Same issue with ES 6.8.6 and Curator 5.8.1

stujb commented 4 years ago

Having the same issue. Has anyone discovered the cause of the constantly executing cluster:admin/snapshot/get tasks that cause the curator snapshot deletes to fail ?

ronberna commented 4 years ago

Having the same issue as well using ES 6.8 and curator 5.6.0. Any updates are known workarounds regarding this issue?

f84anton commented 3 years ago

In my case the source of cluster:admin/snapshot/get tasks was prometheus exporter justwatch/elasticsearch_exporter with short es.timeout (8 seconds) and short scrape interval (10 seconds) in serviceMonitor. Increasing timeout up to 30 seconds fixes exporter behavior.

elastic / curator