elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.72k stars 24.67k forks source link

Recovery stage of a searchable snapshot shard stuck at FINALIZE #83878

Open dnhatn opened 2 years ago

dnhatn commented 2 years ago

Elasticsearch Version: 7.17

If there's a failure in prewarmCache, then the recovery stage of a searchable snapshot shard will be stuck at FINALIZE although that its recovery is completed properly.

This is the failure during prewarmCache.

[
  "[instance-0000000058] [2022.02.04-000017][28] prewarming failed for file [_4db.fdt] 
 org.elasticsearch.common.util.concurrent.UncategorizedExecutionException: Failed execution 
 at org.elasticsearch.common.util.concurrent.FutureUtils.rethrowExecutionException(FutureUtils.java:80) ~[elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:72) ~[elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListenerDirectly(ListenableFuture.java:112) [elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:100) [elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:149) [elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:147) [elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.action.StepListener.innerOnFailure(StepListener.java:57) [elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.action.NotifyOnceListener.onFailure(NotifyOnceListener.java:36) [elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.action.ActionListener$Delegating.onFailure(ActionListener.java:66) [elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.action.support.GroupedActionListener.onFailure(GroupedActionListener.java:72) [elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.action.ActionListener$Delegating.onFailure(ActionListener.java:66) [elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.action.ActionListener$RunAfterActionListener.onFailure(ActionListener.java:350) [elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.action.ActionRunnable.onFailure(ActionRunnable.java:77) [elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.onFailure(ThreadContext.java:765) [elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:28) [elasticsearch-7.17.0.jar:7.17.0] 
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?] 
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?] 
 at java.lang.Thread.run(Thread.java:833) [?:?] 
 Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Failed to prefetch file part in cache 
 at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:257) ~[elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:231) ~[elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:53) ~[elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:65) ~[elasticsearch-7.17.0.jar:7.17.0] 
 ... 16 more 
 Caused by: java.io.IOException: Failed to prefetch file part in cache 
 at org.elasticsearch.xpack.searchablesnapshots.store.input.CachedBlobContainerIndexInput.prefetchPart(CachedBlobContainerIndexInput.java:216) ~[?:?] 
 at org.elasticsearch.xpack.searchablesnapshots.store.SearchableSnapshotDirectory.lambda$prewarmCache$10(SearchableSnapshotDirectory.java:547) ~[?:?] 
 at org.elasticsearch.action.ActionRunnable$1.doRun(ActionRunnable.java:34) ~[elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:777) ~[elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-7.17.0.jar:7.17.0] 
 ... 3 more 
 Caused by: java.util.concurrent.ExecutionException: org.apache.lucene.store.AlreadyClosedException: Cache file is evicted 
 at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:257) ~[elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:244) ~[elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:75) ~[elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.xpack.searchablesnapshots.store.input.CachedBlobContainerIndexInput.prefetchPart(CachedBlobContainerIndexInput.java:206) ~[?:?] 
 at org.elasticsearch.xpack.searchablesnapshots.store.SearchableSnapshotDirectory.lambda$prewarmCache$10(SearchableSnapshotDirectory.java:547) ~[?:?] 
 at org.elasticsearch.action.ActionRunnable$1.doRun(ActionRunnable.java:34) ~[elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:777) ~[elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-7.17.0.jar:7.17.0] 
 ... 3 more 
 Caused by: org.apache.lucene.store.AlreadyClosedException: Cache file is evicted 
 at org.elasticsearch.xpack.searchablesnapshots.cache.common.CacheFile.throwAlreadyEvicted(CacheFile.java:325) ~[?:?] 
 at org.elasticsearch.xpack.searchablesnapshots.cache.common.CacheFile.ensureOpen(CacheFile.java:320) ~[?:?] 
 at org.elasticsearch.xpack.searchablesnapshots.cache.common.CacheFile.acquireFileChannelReference(CacheFile.java:454) ~[?:?] 
 at org.elasticsearch.xpack.searchablesnapshots.cache.common.CacheFile.populateAndRead(CacheFile.java:356) ~[?:?] 
 at org.elasticsearch.xpack.searchablesnapshots.store.input.CachedBlobContainerIndexInput.prefetchPart(CachedBlobContainerIndexInput.java:194) ~[?:?] 
 at org.elasticsearch.xpack.searchablesnapshots.store.SearchableSnapshotDirectory.lambda$prewarmCache$10(SearchableSnapshotDirectory.java:547) ~[?:?] 
 at org.elasticsearch.action.ActionRunnable$1.doRun(ActionRunnable.java:34) ~[elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:777) ~[elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-7.17.0.jar:7.17.0] 
 ... 3 more"
]
elasticmachine commented 2 years ago

Pinging @elastic/es-distributed (Team:Distributed)

tlrx commented 2 years ago

We noticed that the current behaviour of fully mounted indices shards being stuck at FINALIZE stage because of prewarming failures is confusing for users.

@henningandersen and I discussed this and we agreed on implementing some retrying logic (maybe at the directory level) for cache file prewarming. In addition to this retry logic we could add a FINALIZE_RETRY stage that would make more explicit that we encountered some errors during prewarming and that we are now retrying. We should also make sure that we are not polluting the logs with prewarming errors.

rpasche commented 2 years ago

It looks like we are truly hit by this problem in our productive clusters now.

rpasche commented 2 years ago

Question: Could such a cache eviction be caused by the _cache/clear API call (while the restore is running)? https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-clearcache.html

tlrx commented 2 years ago

Question: Could such a cache eviction be caused by the _cache/clear API call (while the restore is running)? https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-clearcache.html

No. This API clears the index request, query and field data caches but not the caches used by searchable snapshot.

Usually cache file evictions occurs when a shard is relocated, removed or closed during it's prewarming.

gbanasiak commented 1 year ago

It's worth no note there might be different scenarios leading to this problem. Example with network socket read timeout from version 7.17.7:

2023-04-12T03:14:07.236Z WARN [0] prewarming failed for file [_86.fdt]
[..]
        "stacktrace": [
          "org.elasticsearch.common.util.concurrent.UncategorizedExecutionException: Failed execution",
[..]
          "Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Failed to prefetch file part in cache",
[..]
         "Caused by: java.io.IOException: Failed to prefetch file part in cache",
[..]
          "Caused by: java.net.SocketTimeoutException: Read timed out",                 <----- HERE
          "at sun.nio.ch.NioSocketImpl.timedRead(NioSocketImpl.java:273) ~[?:?]",
          "at sun.nio.ch.NioSocketImpl.implRead(NioSocketImpl.java:299) ~[?:?]",
          "at sun.nio.ch.NioSocketImpl.read(NioSocketImpl.java:340) ~[?:?]",
          "at sun.nio.ch.NioSocketImpl$1.read(NioSocketImpl.java:789) ~[?:?]",
          "at java.net.Socket$SocketInputStream.read(Socket.java:1025) ~[?:?]",
          "at sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:477) ~[?:?]",
          "at sun.security.ssl.SSLSocketInputRecord.readHeader(SSLSocketInputRecord.java:471) ~[?:?]",
          "at sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(SSLSocketInputRecord.java:70) ~[?:?]",
          "at sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1465) ~[?:?]",
          "at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:1069) ~[?:?]",
          "at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137) ~[?:?]",
          "at org.apache.http.impl.io.SessionInputBufferImpl.read(SessionInputBufferImpl.java:197) ~[?:?]",
          "at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:176) ~[?:?]",
          "at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:135) ~[?:?]",
          "at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:90) ~[?:?]",
          "at com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:180) ~[?:?]",
          "at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:90) ~[?:?]",
          "at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:90) ~[?:?]",
          "at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:90) ~[?:?]",
          "at com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:180) ~[?:?]",
          "at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:90) ~[?:?]",
          "at com.amazonaws.util.LengthCheckInputStream.read(LengthCheckInputStream.java:107) ~[?:?]",
          "at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:90) ~[?:?]",
          "at com.amazonaws.services.s3.internal.S3AbortableInputStream.read(S3AbortableInputStream.java:125) ~[?:?]",
          "at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:90) ~[?:?]",
          "at org.elasticsearch.repositories.s3.S3RetryingInputStream.read(S3RetryingInputStream.java:141) ~[?:?]",
          "at java.io.FilterInputStream.read(FilterInputStream.java:119) ~[?:?]",
          "at org.elasticsearch.index.snapshots.blobstore.RateLimitingInputStream.read(RateLimitingInputStream.java:62) ~[elasticsearch-7.17.7.jar:7.17.7]",
          "at java.io.FilterInputStream.read(FilterInputStream.java:119) ~[?:?]",
[..]
nerophon commented 1 year ago

This is still happening. Caused repeated plan change failure of a routine operation on a large cluster. Required 4 hours manual labour to identify and workaround. This process was to identify the stuck shards and run:

{
  "commands": [
    {
      "cancel": {
        "index": "<index_name>", 
        "shard": <shard_number>,
        "node": "<node_name>",
        "allow_primary": "<true/false>"
      }
    }
]
}

Please note that this happened on an 8.9 ESS cluster. Therefore I believe this bug is either not limited to 7.17 or we have found a new bug with identical symptoms but different root cause.