Error restoring backup from version 6.8 to 8.6

lnowicki10 commented 1 year ago

Elasticsearch Version

8.6.1

Installed Plugins

No response

Java Version

bundled

OS Version

CentOS 7

Problem Description

We are unable to restore some large backup from version 6.8.13 on 8.6.1

Restore fails on some shards with these messages : Caused by: [users/owINpknoTCmTe2kMmzuRhg][[users][25]] org.elasticsearch.index.shard.IndexShardRecoveryException: failed recovery ... 20 more Caused by: [users/owINpknoTCmTe2kMmzuRhg][[users][25]] org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: restore failed ... 18 more Caused by: java.lang.IllegalStateException: Maximum sequence number [117122753] from last commit does not match global checkpoint [117122751] The same snapshot restores without a problem on a 7.X cluster. The problem occurs on random shards on large indices with lots of data (100 shards, 2TB data)

Steps to Reproduce

Try to restore some large dataset from 6.X to 8.X.

Logs (if relevant)

[2023-01-30T15:17:10,547][WARN ][o.e.c.r.a.AllocationService] [es-master1-archive]failing shard [FailedShard[routingEntry=[users][13], node[do0HqvNyRluI84MCgDzBHA], [P], recovery_source[snapshot recovery [LVdaXt-nQ8i0D2lEy0QWAw] from backupS3_1674680402:snapshot_1674939601/6h2NrRU3RcmQgmnZizHIoA], s[INITIALIZING], a[id=2JHqNugaSfSRYkGsKHrYBA], unassigned_info[[reason=ALLOCATION_FAILED], at[2023-01-30T14:17:10.140Z], failed_attempts[4], failed_nodes[[do0HqvNyRluI84MCgDzBHA]], delayed=false, last_node[do0HqvNyRluI84MCgDzBHA], details[failed shard on node [do0HqvNyRluI84MCgDzBHA]: failed recovery, failure org.elasticsearch.indices.recovery.RecoveryFailedException: [users][13]: Recovery failed on {es16-archive-2}{do0HqvNyRluI84MCgDzBHA}{6RlBvSk_QjOUpaDcpJEkhg}{es16-archive-2}{10.94.121.5}{10.94.121.5:9301}{d}{xpack.installed=true} at org.elasticsearch.index.shard.IndexShard.lambda$executeRecovery$24(IndexShard.java:3123) at org.elasticsearch.action.ActionListener$2.onFailure(ActionListener.java:170) at org.elasticsearch.index.shard.StoreRecovery.lambda$recoveryListener$6(StoreRecovery.java:385) at org.elasticsearch.action.ActionListener$2.onFailure(ActionListener.java:170) at org.elasticsearch.index.shard.StoreRecovery.lambda$restore$8(StoreRecovery.java:518) at org.elasticsearch.action.ActionListener$2.onFailure(ActionListener.java:170) at org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:164) at org.elasticsearch.action.ActionListener$DelegatingActionListener.onResponse(ActionListener.java:212) at org.elasticsearch.action.ActionListener$RunBeforeActionListener.onResponse(ActionListener.java:397) at org.elasticsearch.repositories.blobstore.FileRestoreContext.lambda$restore$1(FileRestoreContext.java:166) at org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:162) at org.elasticsearch.action.ActionListener$MappedActionListener.onResponse(ActionListener.java:127) at org.elasticsearch.action.support.GroupedActionListener.onResponse(GroupedActionListener.java:55) at org.elasticsearch.action.ActionListener$DelegatingActionListener.onResponse(ActionListener.java:212) at org.elasticsearch.repositories.blobstore.BlobStoreRepository$11.executeOneFileRestore(BlobStoreRepository.java:3066) at org.elasticsearch.repositories.blobstore.BlobStoreRepository$11.lambda$executeOneFileRestore$1(BlobStoreRepository.java:3075) at org.elasticsearch.action.ActionRunnable$3.doRun(ActionRunnable.java:72) at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:917) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) at java.lang.Thread.run(Thread.java:1589) Caused by: [users/owINpknoTCmTe2kMmzuRhg][[users][13]] org.elasticsearch.index.shard.IndexShardRecoveryException: failed recovery ... 20 more Caused by: [users/owINpknoTCmTe2kMmzuRhg][[users][13]] org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: restore failed ... 18 more Caused by: java.lang.IllegalStateException: Maximum sequence number [105904628] from last commit does not match global checkpoint [105904627] at org.elasticsearch.index.engine.ReadOnlyEngine.ensureMaxSeqNoEqualsToGlobalCheckpoint(ReadOnlyEngine.java:184) at org.elasticsearch.index.engine.ReadOnlyEngine.<init>(ReadOnlyEngine.java:121) at org.elasticsearch.xpack.lucene.bwc.OldLuceneVersions.lambda$getEngineFactory$4(OldLuceneVersions.java:248) at org.elasticsearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:1949) at org.elasticsearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:1913) at org.elasticsearch.index.shard.StoreRecovery.lambda$restore$7(StoreRecovery.java:513) at org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:162) ... 15 more

elasticsearchmachine commented 1 year ago

Pinging @elastic/es-search (Team:Search)

DaveCTurner commented 1 year ago

Maximum sequence number [105904628] from last commit does not match global checkpoint [105904627]

I think this can legitimately happen if the snapshot was taken while indexing was ongoing. We don't restore regular snapshots into a ReadOnlyEngine so I think it's not an issue there, hence labelling this for the search team. It's possible this affects searchable snapshots too, although I think less frequently because of how ILM typically manages them.

elasticsearchmachine commented 2 months ago

Pinging @elastic/es-search-foundations (Team:Search Foundations)

elastic / elasticsearch