Closed maf23 closed 10 years ago
this does look like the bug you are referring to. thanks for reporting this!
An additional note, we have noted that this seems to happen (at least more frequently) when we post multiple parallel scan queries. Which seems to make sense from what I can see in the stack trace.
@maf23 Were you using the same scroll id multiple times in the parallel scan queries?
The scroll id should be different. But we will check our code to make sure this is actually the case.
On Wed, Aug 27, 2014 at 2:30 PM, Martijn van Groningen < notifications@github.com> wrote:
@maf23 https://github.com/maf23 Were you using the same scroll id multiple times in the parallel scan queries?
— Reply to this email directly or view it on GitHub https://github.com/elasticsearch/elasticsearch/issues/7478#issuecomment-53565244 .
I can see how this situation can occur if multiple scroll requests are scrolling in parallel with the same scroll id (or same scroll id prefix), the scroll api was never designed to support this. I think we need proper validation if two search requests try to access the same scan context that is open on a node.
Also running the clear scoll api during a scroll session can cause this bug.
@maf23 Can you share what jvm version and vendor you're using?
Sure, Oracle JVM 1.7.0_25
Ok thanks, like you mentioned the ConcurrentHashMap should be used here since the map in question is accessed by different threads during the entire scroll.
@maf23 Pushed a fix for this bug, which will be included in the next release. Thanks for reporting this!
We seem to have a problem with stuck threads in an Elasticsearch cluster. It appears at random, but once a thread is stuck it seems to keep being stuck until elasticsearch on that node is restarted. The theads get stuck in a busy loop and the stack trace of one is:
It looks very much as the known problem of using the non-synchronized HashMap class in a threaded environment, see (http://stackoverflow.com/questions/17070184/hashmap-stuck-on-get). Unfortunately I'm not familiar enough with the es code to know if this can be the issue.
The solution mentioned at the link is to use ConcurrentHashMap instead.