Open DaveCTurner opened 4 months ago
Pinging @elastic/es-distributed (Team:Distributed)
Reproduced for me (once) after a few thousand runs. Curious, because after we fixed https://github.com/elastic/elasticsearch/pull/105245 I ran this suite in a loop for ages without any failures, so maybe it's something we've introduced since then. I'm trying again with more logging.
Perhaps interestingly it looks like there were indeed no failures of this test suite from early Feb (when #105245 was merged) until mid-March:
Changed to medium based on previous instances. Feel free to relabel. Also please assign yourself if you're working on it, so others know which issues are free to pick.
Managed to capture another similar failure on a branch with copious logging: testoutput-2024-05-28T04:49:28.436Z.tar.gz
java.lang.AssertionError: [index-5][0]: alahtCf9QomF-3vqOpOmWw vs 1YrxU3yWQxSHMACBh3uX4g
at __randomizedtesting.SeedInfo.seed([285E9E9DF3193DC]:0)
at org.elasticsearch.snapshots.InFlightShardSnapshotStates.assertGenerationConsistency(InFlightShardSnapshotStates.java:100)
at org.elasticsearch.snapshots.InFlightShardSnapshotStates.addStateInformation(InFlightShardSnapshotStates.java:69)
at org.elasticsearch.snapshots.InFlightShardSnapshotStates.forEntries(InFlightShardSnapshotStates.java:54)
at org.elasticsearch.cluster.SnapshotsInProgress.assertConsistentEntries(SnapshotsInProgress.java:406)
at org.elasticsearch.cluster.SnapshotsInProgress.<init>(SnapshotsInProgress.java:119)
at org.elasticsearch.cluster.SnapshotsInProgress.withUpdatedEntriesForRepo(SnapshotsInProgress.java:135)
at org.elasticsearch.snapshots.SnapshotsService.stateWithoutSnapshot(SnapshotsService.java:1864)
at org.elasticsearch.repositories.FinalizeSnapshotContext.updatedClusterState(FinalizeSnapshotContext.java:103)
at org.elasticsearch.repositories.blobstore.BlobStoreRepository$2.apply(BlobStoreRepository.java:1757)
at org.elasticsearch.repositories.blobstore.BlobStoreRepository$2.apply(BlobStoreRepository.java:1754)
at org.elasticsearch.repositories.blobstore.BlobStoreRepository$10.execute(BlobStoreRepository.java:2805)
at org.elasticsearch.cluster.service.MasterService$UnbatchedExecutor.execute(MasterService.java:579)
at org.elasticsearch.cluster.service.MasterService.innerExecuteTasks(MasterService.java:1078)
at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:1043)
at org.elasticsearch.cluster.service.MasterService.executeAndPublishBatch(MasterService.java:238)
at org.elasticsearch.cluster.service.MasterService$BatchingTaskQueue$Processor.lambda$run$2(MasterService.java:1684)
at org.elasticsearch.action.ActionListener.run(ActionListener.java:433)
at org.elasticsearch.cluster.service.MasterService$BatchingTaskQueue$Processor.run(MasterService.java:1681)
at org.elasticsearch.cluster.service.MasterService$5.lambda$doRun$0(MasterService.java:1276)
at org.elasticsearch.action.ActionListener.run(ActionListener.java:433)
at org.elasticsearch.cluster.service.MasterService$5.doRun(MasterService.java:1255)
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1570)
I wonder if the assertion is wrong? But if so, how is it not tripping more often? Still investigating...
I spent quite some time investigating this recently. I think the assertion is valid, and I have a hunch that the problem is here:
If we snapshot an index, then delete and re-create it before the snapshot is finalized, then snapshot.indexByName()
will return the Index
that was originally snapshotted, and then metadata.index()
will try and look up this index by UUID in the newer metadata
and it won't be there since the delete-and-create operation will have made an index with the same name but a different UUID. In that situation it looks like we expect the index shard snapshots to have failed, but in fact they could have succeeded before the index was deleted, and in that case we should adjust the generation UUID for that shard snapshot to match.
That said, I've been unable to reproduce this sequence of events in SnapshotResiliencyTests
so far. But I haven't tried especially hard yet.
Build scan: https://gradle-enterprise.elastic.co/s/vgv2xu3qpq5qk/tests/:server:internalClusterTest/org.elasticsearch.snapshots.SnapshotStressTestsIT/testRandomActivities
Reproduction line:
Applicable branches: main
Reproduces locally?: Didn't try
Failure history: Failure dashboard for
org.elasticsearch.snapshots.SnapshotStressTestsIT#testRandomActivities
&_a=(controlGroupInput:(chainingSystem:HIERARCHICAL,controlStyle:twoLine,ignoreParentSettings:(ignoreFilters:!f,ignoreQuery:!f,ignoreTimerange:!f,ignoreValidations:!t),panels:('0c0c9cb8-ccd2-45c6-9b13-96bac4abc542':(explicitInput:(dataViewId:fbbdc689-be23-4b3d-8057-aa402e9ed0c5,enhancements:(),fieldName:task.keyword,grow:!t,id:'0c0c9cb8-ccd2-45c6-9b13-96bac4abc542',searchTechnique:wildcard,selectedOptions:!(),singleSelect:!t,title:'Gradle%20Task',width:medium),grow:!t,order:0,type:optionsListControl,width:small),'144933da-5c1b-4257-a969-7f43455a7901':(explicitInput:(dataViewId:fbbdc689-be23-4b3d-8057-aa402e9ed0c5,enhancements:(),fieldName:name.keyword,grow:!t,id:'144933da-5c1b-4257-a969-7f43455a7901',searchTechnique:wildcard,selectedOptions:!('testRandomActivities'),title:Test,width:medium),grow:!t,order:2,type:optionsListControl,width:medium),'4e6ad9d6-6fdc-4fcc-bf1a-aa6ca79e0850':(explicitInput:(dataViewId:fbbdc689-be23-4b3d-8057-aa402e9ed0c5,enhancements:(),fieldName:className.keyword,grow:!t,id:'4e6ad9d6-6fdc-4fcc-bf1a-aa6ca79e0850',searchTechnique:wildcard,selectedOptions:!('org.elasticsearch.snapshots.SnapshotStressTestsIT'),title:Suite,width:medium),grow:!t,order:1,type:optionsListControl,width:medium)))))Failure excerpt: