A user reported to me that they had inadvertently run a very expensive collection of queries which caused stress to their cluster so they cancelled them, but some indices:data/read/search[phase/query] tasks continued to run for a very long time after being cancelled and eventually they had to restart nodes to restore their cluster back to a working state. They shared a thread dump which shows various places where we appear to be missing cancellation detection today, most commonly in stack traces that look like this one:
100.0% [cpu=99.9%, other=0.1%] (500ms out of 500ms) cpu usage by thread 'elasticsearch[REDACTED][search_worker][T#6]'
10/10 snapshots sharing following 35 elements
app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.aggregations.bucket.terms.LongKeyedBucketOrds$FromMany$1.next(LongKeyedBucketOrds.java:368)
app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.aggregations.bucket.terms.GlobalOrdinalsStringTermsAggregator$RemapGlobalOrds.forEach(GlobalOrdinalsStringTermsAggregator.java:560)
app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.aggregations.bucket.terms.GlobalOrdinalsStringTermsAggregator$ResultStrategy.buildAggregations(GlobalOrdinalsStringTermsAggregator.java:606)
app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.aggregations.bucket.terms.GlobalOrdinalsStringTermsAggregator.buildAggregations(GlobalOrdinalsStringTermsAggregator.java:185)
app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.aggregations.bucket.BestBucketsDeferringCollector$2.buildAggregations(BestBucketsDeferringCollector.java:245)
app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.aggregations.bucket.BucketsAggregator.buildSubAggsForBuckets(BucketsAggregator.java:180)
app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.aggregations.bucket.BucketsAggregator.buildSubAggsForAllBuckets(BucketsAggregator.java:242)
app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.aggregations.bucket.terms.GlobalOrdinalsStringTermsAggregator.access$300(GlobalOrdinalsStringTermsAggregator.java:55)
app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.aggregations.bucket.terms.GlobalOrdinalsStringTermsAggregator$StandardTermsResults.buildSubAggs(GlobalOrdinalsStringTermsAggregator.java:766)
app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.aggregations.bucket.terms.GlobalOrdinalsStringTermsAggregator$StandardTermsResults.buildSubAggs(GlobalOrdinalsStringTermsAggregator.java:715)
app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.aggregations.bucket.terms.GlobalOrdinalsStringTermsAggregator$ResultStrategy.buildAggregations(GlobalOrdinalsStringTermsAggregator.java:630)
app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.aggregations.bucket.terms.GlobalOrdinalsStringTermsAggregator.buildAggregations(GlobalOrdinalsStringTermsAggregator.java:185)
app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.aggregations.bucket.BestBucketsDeferringCollector$2.buildAggregations(BestBucketsDeferringCollector.java:245)
app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.aggregations.bucket.BucketsAggregator.buildSubAggsForBuckets(BucketsAggregator.java:180)
app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.aggregations.bucket.BucketsAggregator.buildSubAggsForAllBuckets(BucketsAggregator.java:242)
app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.aggregations.bucket.terms.MapStringTermsAggregator.access$100(MapStringTermsAggregator.java:50)
app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.aggregations.bucket.terms.MapStringTermsAggregator$StandardTermsResults.buildSubAggs(MapStringTermsAggregator.java:439)
app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.aggregations.bucket.terms.MapStringTermsAggregator$StandardTermsResults.buildSubAggs(MapStringTermsAggregator.java:357)
app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.aggregations.bucket.terms.MapStringTermsAggregator$ResultStrategy.buildAggregations(MapStringTermsAggregator.java:276)
app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.aggregations.bucket.terms.MapStringTermsAggregator.buildAggregations(MapStringTermsAggregator.java:112)
app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.aggregations.Aggregator.buildTopLevel(Aggregator.java:159)
app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.aggregations.AggregatorCollector.doPostCollection(AggregatorCollector.java:47)
app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.query.QueryPhaseCollector.doPostCollection(QueryPhaseCollector.java:379)
app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.internal.ContextIndexSearcher.doAggregationPostCollection(ContextIndexSearcher.java:486)
app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:475)
app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.internal.ContextIndexSearcher.lambda$search$4(ContextIndexSearcher.java:375)
app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.internal.ContextIndexSearcher$$Lambda/0x00007f8aa1962038.call(Unknown Source)
java.base@21.0.1/java.util.concurrent.FutureTask.run(FutureTask.java:317)
app/org.elasticsearch.server@8.11.1/org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33)
app/org.elasticsearch.server@8.11.1/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:983)
app/org.elasticsearch.server@8.11.1/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
java.base@21.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
java.base@21.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
java.base@21.0.1/java.lang.Thread.runWith(Thread.java:1596)
java.base@21.0.1/java.lang.Thread.run(Thread.java:1583)
A user reported to me that they had inadvertently run a very expensive collection of queries which caused stress to their cluster so they cancelled them, but some
indices:data/read/search[phase/query]
tasks continued to run for a very long time after being cancelled and eventually they had to restart nodes to restore their cluster back to a working state. They shared a thread dump which shows various places where we appear to be missing cancellation detection today, most commonly in stack traces that look like this one: