Open DaveCTurner opened 1 week ago
Observed in the wild in a cluster state that took ~5s to apply. There were several GENERIC threads all performing recoveries and running here:
GENERIC
app/org.elasticsearch.server@8.15.0/org.elasticsearch.index.shard.IndexShard.createEngine(IndexShard.java:2146) app/org.elasticsearch.server@8.15.0/org.elasticsearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:2119) app/org.elasticsearch.server@8.15.0/org.elasticsearch.index.shard.IndexShard.openEngineAndSkipTranslogRecovery(IndexShard.java:2095) app/org.elasticsearch.server@8.15.0/org.elasticsearch.indices.recovery.PeerRecoveryTargetService.lambda$doRecovery$2(PeerRecoveryTargetService.java:310) app/org.elasticsearch.server@8.15.0/org.elasticsearch.indices.recovery.PeerRecoveryTargetService$$Lambda/0x000000c802a52458.apply(Unknown Source) app/org.elasticsearch.server@8.15.0/org.elasticsearch.action.ActionListenerImplementations$MappedActionListener.onResponse(ActionListenerImplementations.java:89) app/org.elasticsearch.server@8.15.0/org.elasticsearch.action.ActionListenerImplementations$DelegatingResponseActionListener.onResponse(ActionListenerImplementations.java:182) app/org.elasticsearch.server@8.15.0/org.elasticsearch.index.CompositeIndexEventListener.callListeners(CompositeIndexEventListener.java:290) app/org.elasticsearch.server@8.15.0/org.elasticsearch.index.CompositeIndexEventListener.lambda$callListeners$10(CompositeIndexEventListener.java:284) app/org.elasticsearch.server@8.15.0/org.elasticsearch.index.CompositeIndexEventListener$$Lambda/0x000000c802a75000.accept(Unknown Source) app/org.elasticsearch.server@8.15.0/org.elasticsearch.action.ActionRunnable$4.doRun(ActionRunnable.java:100) app/org.elasticsearch.server@8.15.0/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984) app/org.elasticsearch.server@8.15.0/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) java.base@21.0.2/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) java.base@21.0.2/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) java.base@21.0.2/java.lang.Thread.runWith(Thread.java:1596) java.base@21.0.2/java.lang.Thread.run(Thread.java:1583)
Meanwhile the applier thread appears to have been trying to close one of the recovering shards and was blocked here:
0.0% [cpu=0.0%, other=0.0%] (0s out of 500ms) cpu usage by thread 'elasticsearch[es-es-search-74ff455b96-m8c5k][clusterApplierService#updateTask][T#1]' 10/10 snapshots sharing following 21 elements app/org.elasticsearch.server@8.15.0/org.elasticsearch.index.shard.IndexShard.close(IndexShard.java:1724) app/org.elasticsearch.server@8.15.0/org.elasticsearch.index.IndexService.lambda$closeShard$16(IndexService.java:651) app/org.elasticsearch.server@8.15.0/org.elasticsearch.index.IndexService$$Lambda/0x000000c802b8fdd0.accept(Unknown Source) app/org.elasticsearch.server@8.15.0/org.elasticsearch.action.ActionListener.run(ActionListener.java:433) app/org.elasticsearch.server@8.15.0/org.elasticsearch.index.IndexService.closeShard(IndexService.java:629) app/org.elasticsearch.server@8.15.0/org.elasticsearch.index.IndexService.removeShard(IndexService.java:589) app/org.elasticsearch.server@8.15.0/org.elasticsearch.indices.cluster.IndicesClusterStateService.removeIndicesAndShards(IndicesClusterStateService.java:499) app/org.elasticsearch.server@8.15.0/org.elasticsearch.indices.cluster.IndicesClusterStateService.doApplyClusterState(IndicesClusterStateService.java:302) app/org.elasticsearch.server@8.15.0/org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:260) app/org.elasticsearch.server@8.15.0/org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:544) app/org.elasticsearch.server@8.15.0/org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:530) app/org.elasticsearch.server@8.15.0/org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:503) app/org.elasticsearch.server@8.15.0/org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:432) app/org.elasticsearch.server@8.15.0/org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:156) app/org.elasticsearch.server@8.15.0/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:917) app/org.elasticsearch.server@8.15.0/org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:217) app/org.elasticsearch.server@8.15.0/org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:183) java.base@21.0.2/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) java.base@21.0.2/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) java.base@21.0.2/java.lang.Thread.runWith(Thread.java:1596) java.base@21.0.2/java.lang.Thread.run(Thread.java:1583)
Relates #89821
Pinging @elastic/es-distributed (Team:Distributed)
Observed in the wild in a cluster state that took ~5s to apply. There were several
GENERIC
threads all performing recoveries and running here:Meanwhile the applier thread appears to have been trying to close one of the recovering shards and was blocked here:
Relates #89821