Open lnowicki10 opened 1 year ago
Pinging @elastic/es-search (Team:Search)
Maximum sequence number [105904628] from last commit does not match global checkpoint [105904627]
I think this can legitimately happen if the snapshot was taken while indexing was ongoing. We don't restore regular snapshots into a ReadOnlyEngine
so I think it's not an issue there, hence labelling this for the search team. It's possible this affects searchable snapshots too, although I think less frequently because of how ILM typically manages them.
Pinging @elastic/es-search-foundations (Team:Search Foundations)
Elasticsearch Version
8.6.1
Installed Plugins
No response
Java Version
bundled
OS Version
CentOS 7
Problem Description
We are unable to restore some large backup from version 6.8.13 on 8.6.1
Restore fails on some shards with these messages :
Caused by: [users/owINpknoTCmTe2kMmzuRhg][[users][25]] org.elasticsearch.index.shard.IndexShardRecoveryException: failed recovery ... 20 more Caused by: [users/owINpknoTCmTe2kMmzuRhg][[users][25]] org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: restore failed ... 18 more Caused by: java.lang.IllegalStateException: Maximum sequence number [117122753] from last commit does not match global checkpoint [117122751]
The same snapshot restores without a problem on a 7.X cluster. The problem occurs on random shards on large indices with lots of data (100 shards, 2TB data)Steps to Reproduce
Try to restore some large dataset from 6.X to 8.X.
Logs (if relevant)
[2023-01-30T15:17:10,547][WARN ][o.e.c.r.a.AllocationService] [es-master1-archive]failing shard [FailedShard[routingEntry=[users][13], node[do0HqvNyRluI84MCgDzBHA], [P], recovery_source[snapshot recovery [LVdaXt-nQ8i0D2lEy0QWAw] from backupS3_1674680402:snapshot_1674939601/6h2NrRU3RcmQgmnZizHIoA], s[INITIALIZING], a[id=2JHqNugaSfSRYkGsKHrYBA], unassigned_info[[reason=ALLOCATION_FAILED], at[2023-01-30T14:17:10.140Z], failed_attempts[4], failed_nodes[[do0HqvNyRluI84MCgDzBHA]], delayed=false, last_node[do0HqvNyRluI84MCgDzBHA], details[failed shard on node [do0HqvNyRluI84MCgDzBHA]: failed recovery, failure org.elasticsearch.indices.recovery.RecoveryFailedException: [users][13]: Recovery failed on {es16-archive-2}{do0HqvNyRluI84MCgDzBHA}{6RlBvSk_QjOUpaDcpJEkhg}{es16-archive-2}{10.94.121.5}{10.94.121.5:9301}{d}{xpack.installed=true} at org.elasticsearch.index.shard.IndexShard.lambda$executeRecovery$24(IndexShard.java:3123) at org.elasticsearch.action.ActionListener$2.onFailure(ActionListener.java:170) at org.elasticsearch.index.shard.StoreRecovery.lambda$recoveryListener$6(StoreRecovery.java:385) at org.elasticsearch.action.ActionListener$2.onFailure(ActionListener.java:170) at org.elasticsearch.index.shard.StoreRecovery.lambda$restore$8(StoreRecovery.java:518) at org.elasticsearch.action.ActionListener$2.onFailure(ActionListener.java:170) at org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:164) at org.elasticsearch.action.ActionListener$DelegatingActionListener.onResponse(ActionListener.java:212) at org.elasticsearch.action.ActionListener$RunBeforeActionListener.onResponse(ActionListener.java:397) at org.elasticsearch.repositories.blobstore.FileRestoreContext.lambda$restore$1(FileRestoreContext.java:166) at org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:162) at org.elasticsearch.action.ActionListener$MappedActionListener.onResponse(ActionListener.java:127) at org.elasticsearch.action.support.GroupedActionListener.onResponse(GroupedActionListener.java:55) at org.elasticsearch.action.ActionListener$DelegatingActionListener.onResponse(ActionListener.java:212) at org.elasticsearch.repositories.blobstore.BlobStoreRepository$11.executeOneFileRestore(BlobStoreRepository.java:3066) at org.elasticsearch.repositories.blobstore.BlobStoreRepository$11.lambda$executeOneFileRestore$1(BlobStoreRepository.java:3075) at org.elasticsearch.action.ActionRunnable$3.doRun(ActionRunnable.java:72) at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:917) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) at java.lang.Thread.run(Thread.java:1589) Caused by: [users/owINpknoTCmTe2kMmzuRhg][[users][13]] org.elasticsearch.index.shard.IndexShardRecoveryException: failed recovery ... 20 more Caused by: [users/owINpknoTCmTe2kMmzuRhg][[users][13]] org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: restore failed ... 18 more Caused by: java.lang.IllegalStateException: Maximum sequence number [105904628] from last commit does not match global checkpoint [105904627] at org.elasticsearch.index.engine.ReadOnlyEngine.ensureMaxSeqNoEqualsToGlobalCheckpoint(ReadOnlyEngine.java:184) at org.elasticsearch.index.engine.ReadOnlyEngine.<init>(ReadOnlyEngine.java:121) at org.elasticsearch.xpack.lucene.bwc.OldLuceneVersions.lambda$getEngineFactory$4(OldLuceneVersions.java:248) at org.elasticsearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:1949) at org.elasticsearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:1913) at org.elasticsearch.index.shard.StoreRecovery.lambda$restore$7(StoreRecovery.java:513) at org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:162) ... 15 more