elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.6k stars 24.63k forks source link

Aggregation keeps running long after task cancellation #108701

Open DaveCTurner opened 4 months ago

DaveCTurner commented 4 months ago

A user reported to me that they had inadvertently run a very expensive collection of queries which caused stress to their cluster so they cancelled them, but some indices:data/read/search[phase/query] tasks continued to run for a very long time after being cancelled and eventually they had to restart nodes to restore their cluster back to a working state. They shared a thread dump which shows various places where we appear to be missing cancellation detection today, most commonly in stack traces that look like this one:

   100.0% [cpu=99.9%, other=0.1%] (500ms out of 500ms) cpu usage by thread 'elasticsearch[REDACTED][search_worker][T#6]'
     10/10 snapshots sharing following 35 elements
       app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.aggregations.bucket.terms.LongKeyedBucketOrds$FromMany$1.next(LongKeyedBucketOrds.java:368)
       app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.aggregations.bucket.terms.GlobalOrdinalsStringTermsAggregator$RemapGlobalOrds.forEach(GlobalOrdinalsStringTermsAggregator.java:560)
       app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.aggregations.bucket.terms.GlobalOrdinalsStringTermsAggregator$ResultStrategy.buildAggregations(GlobalOrdinalsStringTermsAggregator.java:606)
       app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.aggregations.bucket.terms.GlobalOrdinalsStringTermsAggregator.buildAggregations(GlobalOrdinalsStringTermsAggregator.java:185)
       app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.aggregations.bucket.BestBucketsDeferringCollector$2.buildAggregations(BestBucketsDeferringCollector.java:245)
       app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.aggregations.bucket.BucketsAggregator.buildSubAggsForBuckets(BucketsAggregator.java:180)
       app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.aggregations.bucket.BucketsAggregator.buildSubAggsForAllBuckets(BucketsAggregator.java:242)
       app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.aggregations.bucket.terms.GlobalOrdinalsStringTermsAggregator.access$300(GlobalOrdinalsStringTermsAggregator.java:55)
       app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.aggregations.bucket.terms.GlobalOrdinalsStringTermsAggregator$StandardTermsResults.buildSubAggs(GlobalOrdinalsStringTermsAggregator.java:766)
       app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.aggregations.bucket.terms.GlobalOrdinalsStringTermsAggregator$StandardTermsResults.buildSubAggs(GlobalOrdinalsStringTermsAggregator.java:715)
       app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.aggregations.bucket.terms.GlobalOrdinalsStringTermsAggregator$ResultStrategy.buildAggregations(GlobalOrdinalsStringTermsAggregator.java:630)
       app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.aggregations.bucket.terms.GlobalOrdinalsStringTermsAggregator.buildAggregations(GlobalOrdinalsStringTermsAggregator.java:185)
       app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.aggregations.bucket.BestBucketsDeferringCollector$2.buildAggregations(BestBucketsDeferringCollector.java:245)
       app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.aggregations.bucket.BucketsAggregator.buildSubAggsForBuckets(BucketsAggregator.java:180)
       app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.aggregations.bucket.BucketsAggregator.buildSubAggsForAllBuckets(BucketsAggregator.java:242)
       app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.aggregations.bucket.terms.MapStringTermsAggregator.access$100(MapStringTermsAggregator.java:50)
       app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.aggregations.bucket.terms.MapStringTermsAggregator$StandardTermsResults.buildSubAggs(MapStringTermsAggregator.java:439)
       app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.aggregations.bucket.terms.MapStringTermsAggregator$StandardTermsResults.buildSubAggs(MapStringTermsAggregator.java:357)
       app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.aggregations.bucket.terms.MapStringTermsAggregator$ResultStrategy.buildAggregations(MapStringTermsAggregator.java:276)
       app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.aggregations.bucket.terms.MapStringTermsAggregator.buildAggregations(MapStringTermsAggregator.java:112)
       app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.aggregations.Aggregator.buildTopLevel(Aggregator.java:159)
       app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.aggregations.AggregatorCollector.doPostCollection(AggregatorCollector.java:47)
       app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.query.QueryPhaseCollector.doPostCollection(QueryPhaseCollector.java:379)
       app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.internal.ContextIndexSearcher.doAggregationPostCollection(ContextIndexSearcher.java:486)
       app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:475)
       app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.internal.ContextIndexSearcher.lambda$search$4(ContextIndexSearcher.java:375)
       app/org.elasticsearch.server@8.11.1/org.elasticsearch.search.internal.ContextIndexSearcher$$Lambda/0x00007f8aa1962038.call(Unknown Source)
       java.base@21.0.1/java.util.concurrent.FutureTask.run(FutureTask.java:317)
       app/org.elasticsearch.server@8.11.1/org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33)
       app/org.elasticsearch.server@8.11.1/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:983)
       app/org.elasticsearch.server@8.11.1/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
       java.base@21.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
       java.base@21.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
       java.base@21.0.1/java.lang.Thread.runWith(Thread.java:1596)
       java.base@21.0.1/java.lang.Thread.run(Thread.java:1583)
elasticsearchmachine commented 4 months ago

Pinging @elastic/es-analytical-engine (Team:Analytics)

nik9000 commented 4 months ago

Looks like they are building a huge aggregation. We could certainly check for interruption periodically in there.