elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
70k stars 24.76k forks source link

SnapshotShardService/BlobContainer should retry transient blobstore exceptions #100057

Open idegtiarenko opened 1 year ago

idegtiarenko commented 1 year ago

Description

Today we observe that some blobstore exceptions such as one below cause a shard snapshot to fail leading to a partial snapshot:

java.io.IOException: Unable to upload object [*****] using a single upload
    at org.elasticsearch.repositories.s3.S3BlobContainer.executeSingleUpload(S3BlobContainer.java:417)
    at org.elasticsearch.repositories.s3.S3BlobContainer.lambda$writeBlob$1(S3BlobContainer.java:132)
    at java.base/java.security.AccessController.doPrivileged(AccessController.java:571)
    at org.elasticsearch.repositories.s3.SocketAccess.doPrivilegedIOException(SocketAccess.java:37)
    at org.elasticsearch.repositories.s3.S3BlobContainer.writeBlob(S3BlobContainer.java:130)
    at org.elasticsearch.server@8.11.0-SNAPSHOT/org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotFile(BlobStoreRepository.java:3558)
    at org.elasticsearch.server@8.11.0-SNAPSHOT/org.elasticsearch.repositories.blobstore.ShardSnapshotTaskRunner$FileSnapshotTask.lambda$doRun$0(ShardSnapshotTaskRunner.java:106)
    at org.elasticsearch.server@8.11.0-SNAPSHOT/org.elasticsearch.action.ActionRunnable$1.doRun(ActionRunnable.java:35)
    at org.elasticsearch.server@8.11.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
    at org.elasticsearch.server@8.11.0-SNAPSHOT/org.elasticsearch.repositories.blobstore.ShardSnapshotTaskRunner$FileSnapshotTask.doRun(ShardSnapshotTaskRunner.java:108)
    at org.elasticsearch.server@8.11.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
    at org.elasticsearch.server@8.11.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.PrioritizedThrottledTaskRunner$TaskWrapper.onResponse(PrioritizedThrottledTaskRunner.java:51)
    at org.elasticsearch.server@8.11.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.PrioritizedThrottledTaskRunner$TaskWrapper.onResponse(PrioritizedThrottledTaskRunner.java:27)
    at org.elasticsearch.server@8.11.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.AbstractThrottledTaskRunner$1.doRun(AbstractThrottledTaskRunner.java:134)
    at org.elasticsearch.server@8.11.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:983)
    at org.elasticsearch.server@8.11.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
    at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: We encountered an internal error. Please try again. (Service: Amazon S3; Status Code: 500; Error Code: InternalError; Request ID: *****; S3 Extended Request ID: *****; Proxy: null), S3 Extended Request ID: *****
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1879)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1418)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1387)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1157)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:814)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:781)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:755)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:715)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:697)
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:561)
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:541)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5456)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5403)
    at com.amazonaws.services.s3.AmazonS3Client.access$300(AmazonS3Client.java:421)
    at com.amazonaws.services.s3.AmazonS3Client$PutObjectStrategy.invokeServiceCall(AmazonS3Client.java:6531)
    at com.amazonaws.services.s3.AmazonS3Client.uploadObject(AmazonS3Client.java:1861)
    at com.amazonaws.services.s3.AmazonS3Client.putObject(AmazonS3Client.java:1821)
    at org.elasticsearch.repositories.s3.S3BlobContainer.lambda$executeSingleUpload$16(S3BlobContainer.java:415)
    at org.elasticsearch.repositories.s3.SocketAccess.lambda$doPrivilegedVoid$0(SocketAccess.java:46)
    at java.base/java.security.AccessController.doPrivileged(AccessController.java:319)
    at org.elasticsearch.repositories.s3.SocketAccess.doPrivilegedVoid(SocketAccess.java:45)
    at org.elasticsearch.repositories.s3.S3BlobContainer.executeSingleUpload(S3BlobContainer.java:415)
    ... 18 more

As suggested in the exception, blobstore implementation (s3 in this case) should determine if failure looks transient and retry the operation in order to improve overall snapshot resiliency.

This should likely be applied to other operations as well (such as deleting un-referenced blobs).

elasticsearchmachine commented 1 year ago

Pinging @elastic/es-distributed (Team:Distributed)

VishalMCF commented 1 year ago

@idegtiarenko Can it be considered a beginner issue? If yes then can i Start working on it?

kingherc commented 1 year ago

Please note this impacts deletions as well, such as the following stack trace (redacted):

[...][WARN ][org.elasticsearch.repositories.s3.S3BlobContainer] [instance-00000000...] Failed to delete some blobs [[snapshots/4...5/indices/_v...Q/0/___d...FA][InternalError][We encountered an internal error. Please try again.]]
com.amazonaws.services.s3.model.MultiObjectDeleteException: One or more objects could not be deleted (Service: null; Status Code: 200; Error Code: null; Request ID: D...Z; S3 Extended Request ID: R...M=; Proxy: null)
    at com.amazonaws.services.s3.AmazonS3Client.deleteObjects(AmazonS3Client.java:2345) ~[?:?]
    at org.elasticsearch.repositories.s3.S3BlobContainer.deletePartition(S3BlobContainer.java:381) ~[?:?]
    at org.elasticsearch.repositories.s3.S3BlobContainer.lambda$doDeleteBlobs$5(S3BlobContainer.java:363) ~[?:?]
    at java.util.Iterator.forEachRemaining(Iterator.java:133) ~[?:?]
    at org.elasticsearch.repositories.s3.S3BlobContainer.lambda$doDeleteBlobs$6(S3BlobContainer.java:360) ~[?:?]
    at org.elasticsearch.repositories.s3.SocketAccess.lambda$doPrivilegedVoid$0(SocketAccess.java:46) ~[?:?]
    at java.security.AccessController.doPrivileged(AccessController.java:319) ~[?:?]
    at org.elasticsearch.repositories.s3.SocketAccess.doPrivilegedVoid(SocketAccess.java:45) ~[?:?]
    at org.elasticsearch.repositories.s3.S3BlobContainer.doDeleteBlobs(S3BlobContainer.java:359) ~[?:?]
    at org.elasticsearch.repositories.s3.S3BlobContainer.deleteBlobsIgnoringIfNotExists(S3BlobContainer.java:331) ~[?:?]
    at org.elasticsearch.repositories.blobstore.BlobStoreRepository.deleteFromContainer(BlobStoreRepository.java:1587) ~[elasticsearch-8.8.2.jar:?]
    at org.elasticsearch.repositories.blobstore.BlobStoreRepository.lambda$asyncCleanupUnlinkedShardLevelBlobs$17(BlobStoreRepository.java:972) ~[elasticsearch-8.8.2.jar:?]
    at org.elasticsearch.action.ActionRunnable$3.doRun(ActionRunnable.java:72) ~[elasticsearch-8.8.2.jar:?]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:983) ~[elasticsearch-8.8.2.jar:?]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.8.2.jar:?]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
    at java.lang.Thread.run(Thread.java:1623) ~[?:?]