elastic / elasticsearch-java

Official Elasticsearch Java Client
Apache License 2.0
6 stars 242 forks source link

Deadlock with virtual threads #849

Closed sakuuj closed 2 months ago

sakuuj commented 3 months ago

Java API client version

8.13.4

Java version

21

Elasticsearch Version

8.14.3

Problem description

I have been testing performance of ElasticsearchOperations (Spring Data Elasticsearch, uses org.elasticsearch.client.RestClient underneath), by calling my controller endpoints and noticed a deadlock when using virtual threads and a custom IOReactorConfig with IoThreadCount set to a low number (1 in the current example). I used hatoo/oha and set a number of concurrent requests and a number of total requests to a 100. And there was a deadlock. As I found out after checking the thread dump from jcmd, pinned virtual threads were in a synchronized block:

#773 "tomcat-handler-715" virtual
      java.base/java.lang.Object.wait0(Native Method)
      java.base/java.lang.Object.wait(Object.java:366)
      java.base/java.lang.Object.wait(Object.java:339)
      org.apache.http.concurrent.BasicFuture.get(BasicFuture.java:82)
      org.apache.http.impl.nio.client.FutureWrapper.get(FutureWrapper.java:70)
      org.elasticsearch.client.RestClient.performRequest(RestClient.java:300)
      ...

and some of the unmounted virtual threads, were waiting for the lock in the AbstractNIOConnPool:

#779 "tomcat-handler-721" virtual
      java.base/java.lang.VirtualThread.park(VirtualThread.java:596)
      java.base/java.lang.System$2.parkVirtualThread(System.java:2643)
      java.base/jdk.internal.misc.VirtualThreads.park(VirtualThreads.java:54)
      java.base/java.util.concurrent.locks.LockSupport.park(LockSupport.java:219)
      java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:754)
      java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:990)
      java.base/java.util.concurrent.locks.ReentrantLock$Sync.lock(ReentrantLock.java:153)
      java.base/java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:322)
      org.apache.http.nio.pool.AbstractNIOConnPool.lease(AbstractNIOConnPool.java:278)
      ...

And the elasticsearch thread that should notify threads waiting in a synchronized block, was also blocked by the ReentrantLock in the AbstractNIOConnPool:

#36 "elasticsearch-rest-client-0-thread-2"
      java.base/jdk.internal.misc.Unsafe.park(Native Method)
      java.base/java.util.concurrent.locks.LockSupport.park(LockSupport.java:221)
      java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:754)
      java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:990)
      java.base/java.util.concurrent.locks.ReentrantLock$Sync.lock(ReentrantLock.java:153)
      java.base/java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:322)
      org.apache.http.nio.pool.AbstractNIOConnPool.release(AbstractNIOConnPool.java:344)
     ...

Maybe it happened because:

  1. virtual threads that got in the AbstractNIOConnPool.lease where unmounted due to some I/O operations inside the method, but still holding the AbstractNIOConnPool ReentrantLock
  2. other virtual threads were mounted instead and got blocked by calling Object.wait().
  3. no more platform threads were left
  4. elasticsearch-rest-client-0-thread-2 could never acquire the needed lock again and was blocked forever.

Or maybe that is wrong and there were some other interleavings. But the fact is that a deadlock was detected and it would be nice to change current RestClient implementation.

The tested program configurations:

Hardware: a processor with 8 physical cores.

How to fix the problem, but lose throughput? The problem could be fixed with wrapping all the methods of RestClient with ReentrantLock.lock, ReentrantLock.unlock statements (for example, using Spring AOP). At least there were no deadlocks detected in that case, when manually testing with hatoo/oha with 10_000 of concurrent requests. But such a solution degrades the overall throughput.

l-trotta commented 2 months ago

Hello! As you already figured out, this is a problem with the underlying RestClient used by the java client. Unfortunately, updating it probably won't be enough to fix this issue: many users have been reporting problems when using virtual threads and locks/synchronized blocks, this thread being one of many examples. This is likely going to be fixed in Java 23, as explained in this JEP draft, until then we have no way of ensuring compatibility with virtual threads.

sakuuj commented 2 months ago

Hello! Thank you for your informative response.