Open idegtiarenko opened 1 year ago
Pinging @elastic/es-distributed (Team:Distributed)
@idegtiarenko Can it be considered a beginner issue? If yes then can i Start working on it?
Please note this impacts deletions as well, such as the following stack trace (redacted):
[...][WARN ][org.elasticsearch.repositories.s3.S3BlobContainer] [instance-00000000...] Failed to delete some blobs [[snapshots/4...5/indices/_v...Q/0/___d...FA][InternalError][We encountered an internal error. Please try again.]]
com.amazonaws.services.s3.model.MultiObjectDeleteException: One or more objects could not be deleted (Service: null; Status Code: 200; Error Code: null; Request ID: D...Z; S3 Extended Request ID: R...M=; Proxy: null)
at com.amazonaws.services.s3.AmazonS3Client.deleteObjects(AmazonS3Client.java:2345) ~[?:?]
at org.elasticsearch.repositories.s3.S3BlobContainer.deletePartition(S3BlobContainer.java:381) ~[?:?]
at org.elasticsearch.repositories.s3.S3BlobContainer.lambda$doDeleteBlobs$5(S3BlobContainer.java:363) ~[?:?]
at java.util.Iterator.forEachRemaining(Iterator.java:133) ~[?:?]
at org.elasticsearch.repositories.s3.S3BlobContainer.lambda$doDeleteBlobs$6(S3BlobContainer.java:360) ~[?:?]
at org.elasticsearch.repositories.s3.SocketAccess.lambda$doPrivilegedVoid$0(SocketAccess.java:46) ~[?:?]
at java.security.AccessController.doPrivileged(AccessController.java:319) ~[?:?]
at org.elasticsearch.repositories.s3.SocketAccess.doPrivilegedVoid(SocketAccess.java:45) ~[?:?]
at org.elasticsearch.repositories.s3.S3BlobContainer.doDeleteBlobs(S3BlobContainer.java:359) ~[?:?]
at org.elasticsearch.repositories.s3.S3BlobContainer.deleteBlobsIgnoringIfNotExists(S3BlobContainer.java:331) ~[?:?]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.deleteFromContainer(BlobStoreRepository.java:1587) ~[elasticsearch-8.8.2.jar:?]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.lambda$asyncCleanupUnlinkedShardLevelBlobs$17(BlobStoreRepository.java:972) ~[elasticsearch-8.8.2.jar:?]
at org.elasticsearch.action.ActionRunnable$3.doRun(ActionRunnable.java:72) ~[elasticsearch-8.8.2.jar:?]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:983) ~[elasticsearch-8.8.2.jar:?]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.8.2.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
at java.lang.Thread.run(Thread.java:1623) ~[?:?]
Description
Today we observe that some blobstore exceptions such as one below cause a shard snapshot to fail leading to a partial snapshot:
As suggested in the exception, blobstore implementation (s3 in this case) should determine if failure looks transient and retry the operation in order to improve overall snapshot resiliency.
This should likely be applied to other operations as well (such as deleting un-referenced blobs).