Open dnhatn opened 2 years ago
Pinging @elastic/es-distributed (Team:Distributed)
We noticed that the current behaviour of fully mounted indices shards being stuck at FINALIZE
stage because of prewarming failures is confusing for users.
@henningandersen and I discussed this and we agreed on implementing some retrying logic (maybe at the directory level) for cache file prewarming. In addition to this retry logic we could add a FINALIZE_RETRY
stage that would make more explicit that we encountered some errors during prewarming and that we are now retrying. We should also make sure that we are not polluting the logs with prewarming errors.
It looks like we are truly hit by this problem in our productive clusters now.
Question: Could such a cache eviction
be caused by the _cache/clear
API call (while the restore is running)? https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-clearcache.html
Question: Could such a
cache eviction
be caused by the_cache/clear
API call (while the restore is running)? https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-clearcache.html
No. This API clears the index request, query and field data caches but not the caches used by searchable snapshot.
Usually cache file evictions occurs when a shard is relocated, removed or closed during it's prewarming.
It's worth no note there might be different scenarios leading to this problem. Example with network socket read timeout from version 7.17.7:
2023-04-12T03:14:07.236Z WARN [0] prewarming failed for file [_86.fdt]
[..]
"stacktrace": [
"org.elasticsearch.common.util.concurrent.UncategorizedExecutionException: Failed execution",
[..]
"Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Failed to prefetch file part in cache",
[..]
"Caused by: java.io.IOException: Failed to prefetch file part in cache",
[..]
"Caused by: java.net.SocketTimeoutException: Read timed out", <----- HERE
"at sun.nio.ch.NioSocketImpl.timedRead(NioSocketImpl.java:273) ~[?:?]",
"at sun.nio.ch.NioSocketImpl.implRead(NioSocketImpl.java:299) ~[?:?]",
"at sun.nio.ch.NioSocketImpl.read(NioSocketImpl.java:340) ~[?:?]",
"at sun.nio.ch.NioSocketImpl$1.read(NioSocketImpl.java:789) ~[?:?]",
"at java.net.Socket$SocketInputStream.read(Socket.java:1025) ~[?:?]",
"at sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:477) ~[?:?]",
"at sun.security.ssl.SSLSocketInputRecord.readHeader(SSLSocketInputRecord.java:471) ~[?:?]",
"at sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(SSLSocketInputRecord.java:70) ~[?:?]",
"at sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1465) ~[?:?]",
"at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:1069) ~[?:?]",
"at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137) ~[?:?]",
"at org.apache.http.impl.io.SessionInputBufferImpl.read(SessionInputBufferImpl.java:197) ~[?:?]",
"at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:176) ~[?:?]",
"at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:135) ~[?:?]",
"at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:90) ~[?:?]",
"at com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:180) ~[?:?]",
"at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:90) ~[?:?]",
"at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:90) ~[?:?]",
"at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:90) ~[?:?]",
"at com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:180) ~[?:?]",
"at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:90) ~[?:?]",
"at com.amazonaws.util.LengthCheckInputStream.read(LengthCheckInputStream.java:107) ~[?:?]",
"at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:90) ~[?:?]",
"at com.amazonaws.services.s3.internal.S3AbortableInputStream.read(S3AbortableInputStream.java:125) ~[?:?]",
"at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:90) ~[?:?]",
"at org.elasticsearch.repositories.s3.S3RetryingInputStream.read(S3RetryingInputStream.java:141) ~[?:?]",
"at java.io.FilterInputStream.read(FilterInputStream.java:119) ~[?:?]",
"at org.elasticsearch.index.snapshots.blobstore.RateLimitingInputStream.read(RateLimitingInputStream.java:62) ~[elasticsearch-7.17.7.jar:7.17.7]",
"at java.io.FilterInputStream.read(FilterInputStream.java:119) ~[?:?]",
[..]
This is still happening. Caused repeated plan change failure of a routine operation on a large cluster. Required 4 hours manual labour to identify and workaround. This process was to identify the stuck shards and run:
{
"commands": [
{
"cancel": {
"index": "<index_name>",
"shard": <shard_number>,
"node": "<node_name>",
"allow_primary": "<true/false>"
}
}
]
}
Please note that this happened on an 8.9
ESS cluster. Therefore I believe this bug is either not limited to 7.17
or we have found a new bug with identical symptoms but different root cause.
Elasticsearch Version: 7.17
If there's a failure in
prewarmCache
, then the recovery stage of a searchable snapshot shard will be stuck at FINALIZE although that its recovery is completed properly.This is the failure during
prewarmCache
.