elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
68.23k stars 24.26k forks source link

`IndicesClusterStateService` blocks applier thread acquiring `IndexShard#engineMutex` #108817

Open DaveCTurner opened 1 week ago

DaveCTurner commented 1 week ago

Observed in the wild in a cluster state that took ~5s to apply. There were several GENERIC threads all performing recoveries and running here:

    app/org.elasticsearch.server@8.15.0/org.elasticsearch.index.shard.IndexShard.createEngine(IndexShard.java:2146)
    app/org.elasticsearch.server@8.15.0/org.elasticsearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:2119)
    app/org.elasticsearch.server@8.15.0/org.elasticsearch.index.shard.IndexShard.openEngineAndSkipTranslogRecovery(IndexShard.java:2095)
    app/org.elasticsearch.server@8.15.0/org.elasticsearch.indices.recovery.PeerRecoveryTargetService.lambda$doRecovery$2(PeerRecoveryTargetService.java:310)
    app/org.elasticsearch.server@8.15.0/org.elasticsearch.indices.recovery.PeerRecoveryTargetService$$Lambda/0x000000c802a52458.apply(Unknown Source)
    app/org.elasticsearch.server@8.15.0/org.elasticsearch.action.ActionListenerImplementations$MappedActionListener.onResponse(ActionListenerImplementations.java:89)
    app/org.elasticsearch.server@8.15.0/org.elasticsearch.action.ActionListenerImplementations$DelegatingResponseActionListener.onResponse(ActionListenerImplementations.java:182)
    app/org.elasticsearch.server@8.15.0/org.elasticsearch.index.CompositeIndexEventListener.callListeners(CompositeIndexEventListener.java:290)
    app/org.elasticsearch.server@8.15.0/org.elasticsearch.index.CompositeIndexEventListener.lambda$callListeners$10(CompositeIndexEventListener.java:284)
    app/org.elasticsearch.server@8.15.0/org.elasticsearch.index.CompositeIndexEventListener$$Lambda/0x000000c802a75000.accept(Unknown Source)
    app/org.elasticsearch.server@8.15.0/org.elasticsearch.action.ActionRunnable$4.doRun(ActionRunnable.java:100)
    app/org.elasticsearch.server@8.15.0/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984)
    app/org.elasticsearch.server@8.15.0/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
    java.base@21.0.2/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
    java.base@21.0.2/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
    java.base@21.0.2/java.lang.Thread.runWith(Thread.java:1596)
    java.base@21.0.2/java.lang.Thread.run(Thread.java:1583)

Meanwhile the applier thread appears to have been trying to close one of the recovering shards and was blocked here:

 0.0% [cpu=0.0%, other=0.0%] (0s out of 500ms) cpu usage by thread 'elasticsearch[es-es-search-74ff455b96-m8c5k][clusterApplierService#updateTask][T#1]'
  10/10 snapshots sharing following 21 elements
    app/org.elasticsearch.server@8.15.0/org.elasticsearch.index.shard.IndexShard.close(IndexShard.java:1724)
    app/org.elasticsearch.server@8.15.0/org.elasticsearch.index.IndexService.lambda$closeShard$16(IndexService.java:651)
    app/org.elasticsearch.server@8.15.0/org.elasticsearch.index.IndexService$$Lambda/0x000000c802b8fdd0.accept(Unknown Source)
    app/org.elasticsearch.server@8.15.0/org.elasticsearch.action.ActionListener.run(ActionListener.java:433)
    app/org.elasticsearch.server@8.15.0/org.elasticsearch.index.IndexService.closeShard(IndexService.java:629)
    app/org.elasticsearch.server@8.15.0/org.elasticsearch.index.IndexService.removeShard(IndexService.java:589)
    app/org.elasticsearch.server@8.15.0/org.elasticsearch.indices.cluster.IndicesClusterStateService.removeIndicesAndShards(IndicesClusterStateService.java:499)
    app/org.elasticsearch.server@8.15.0/org.elasticsearch.indices.cluster.IndicesClusterStateService.doApplyClusterState(IndicesClusterStateService.java:302)
    app/org.elasticsearch.server@8.15.0/org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:260)
    app/org.elasticsearch.server@8.15.0/org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:544)
    app/org.elasticsearch.server@8.15.0/org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:530)
    app/org.elasticsearch.server@8.15.0/org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:503)
    app/org.elasticsearch.server@8.15.0/org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:432)
    app/org.elasticsearch.server@8.15.0/org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:156)
    app/org.elasticsearch.server@8.15.0/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:917)
    app/org.elasticsearch.server@8.15.0/org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:217)
    app/org.elasticsearch.server@8.15.0/org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:183)
    java.base@21.0.2/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
    java.base@21.0.2/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
    java.base@21.0.2/java.lang.Thread.runWith(Thread.java:1596)
    java.base@21.0.2/java.lang.Thread.run(Thread.java:1583)

Relates #89821

elasticsearchmachine commented 1 week ago

Pinging @elastic/es-distributed (Team:Distributed)