Internal: Stuck on java.util.HashMap.get?

maf23 commented 10 years ago

We seem to have a problem with stuck threads in an Elasticsearch cluster. It appears at random, but once a thread is stuck it seems to keep being stuck until elasticsearch on that node is restarted. The theads get stuck in a busy loop and the stack trace of one is:

Thread 3744: (state = IN_JAVA)
 - java.util.HashMap.getEntry(java.lang.Object) @bci=72, line=446 (Compiled frame; information may be imprecise)
 - java.util.HashMap.get(java.lang.Object) @bci=11, line=405 (Compiled frame)
 - org.elasticsearch.search.scan.ScanContext$ScanFilter.getDocIdSet(org.apache.lucene.index.AtomicReaderContext, org.apache.lucene.util.Bits) @bci=8, line=156 (Compiled frame)
 - org.elasticsearch.common.lucene.search.ApplyAcceptedDocsFilter.getDocIdSet(org.apache.lucene.index.AtomicReaderContext, org.apache.lucene.util.Bits) @bci=6, line=45 (Compiled frame)
 - org.apache.lucene.search.FilteredQuery$1.scorer(org.apache.lucene.index.AtomicReaderContext, boolean, boolean, org.apache.lucene.util.Bits) @bci=34, line=130 (Compiled frame)
 - org.apache.lucene.search.IndexSearcher.search(java.util.List, org.apache.lucene.search.Weight, org.apache.lucene.search.Collector) @bci=68, line=618 (Compiled frame)
 - org.elasticsearch.search.internal.ContextIndexSearcher.search(java.util.List, org.apache.lucene.search.Weight, org.apache.lucene.search.Collector) @bci=225, line=173 (Compiled frame)
 - org.apache.lucene.search.IndexSearcher.search(org.apache.lucene.search.Query, org.apache.lucene.search.Collector) @bci=11, line=309 (Interpreted frame)
 - org.elasticsearch.search.scan.ScanContext.execute(org.elasticsearch.search.internal.SearchContext) @bci=54, line=52 (Interpreted frame)
 - org.elasticsearch.search.query.QueryPhase.execute(org.elasticsearch.search.internal.SearchContext) @bci=174, line=119 (Compiled frame)
 - org.elasticsearch.search.SearchService.executeScan(org.elasticsearch.search.internal.InternalScrollSearchRequest) @bci=49, line=233 (Interpreted frame)
 - org.elasticsearch.search.action.SearchServiceTransportAction$SearchScanScrollTransportHandler.messageReceived(org.elasticsearch.search.internal.InternalScrollSearchRequest, org.elasticsearch.transport.TransportChannel) @bci=8, line=791 (Interpreted frame)
 - org.elasticsearch.search.action.SearchServiceTransportAction$SearchScanScrollTransportHandler.messageReceived(org.elasticsearch.transport.TransportRequest, org.elasticsearch.transport.TransportChannel) @bci=6, line=780 (Interpreted frame)
 - org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run() @bci=12, line=270 (Compiled frame)
 - java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) @bci=95, line=1145 (Compiled frame)
 - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 (Interpreted frame)
 - java.lang.Thread.run() @bci=11, line=724 (Interpreted frame)

It looks very much as the known problem of using the non-synchronized HashMap class in a threaded environment, see (http://stackoverflow.com/questions/17070184/hashmap-stuck-on-get). Unfortunately I'm not familiar enough with the es code to know if this can be the issue.

The solution mentioned at the link is to use ConcurrentHashMap instead.

s1monw commented 10 years ago

this does look like the bug you are referring to. thanks for reporting this!

maf23 commented 10 years ago

An additional note, we have noted that this seems to happen (at least more frequently) when we post multiple parallel scan queries. Which seems to make sense from what I can see in the stack trace.