Communication with RethinkDB freezing/hanging completely

jwr commented 5 years ago

I am having a serious problem with an application freezing/hanging completely, at non-deterministic times. No exceptions/errors logged, no discernible reason.

I finally managed to get a thread dump using jstack -F, which appears to indicate that I have lots of threads that are blocked waiting for a reply from RethinkDB. Here is an example:

Thread 18006: (state = BLOCKED)
 - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise)
 - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=175 (Compiled frame)
 - java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt() @bci=1, line=836 (Compiled frame)
 - java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(int) @bci=72, line=997 (Compiled frame)
 - java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(int) @bci=24, line=1304 (Compiled frame)
 - java.util.concurrent.CountDownLatch.await() @bci=5, line=231 (Compiled frame)
 - manifold.deferred.Deferred.deref() @bci=192, line=439 (Compiled frame)
 - clojure.core$deref.invokeStatic(java.lang.Object) @bci=13, line=2320 (Compiled frame)
 - clojure.core$deref.invoke(java.lang.Object) @bci=3, line=2306 (Compiled frame)
 - rethinkdb.net$send_initial_query.invokeStatic(java.lang.Object, java.lang.Object) @bci=461, line=185 (Compiled frame)
 - rethinkdb.net$send_initial_query.invoke(java.lang.Object, java.lang.Object) @bci=6, line=173 (Compiled frame)
 - rethinkdb.net$send_start_query.invokeStatic(java.lang.Object, java.lang.Object, clojure.lang.ISeq) @bci=111, line=201 (Compiled frame)
 - rethinkdb.net$send_start_query.doInvoke(java.lang.Object, java.lang.Object, java.lang.Object) @bci=12, line=199 (Compiled frame)
 - clojure.lang.RestFn.invoke(java.lang.Object, java.lang.Object, java.lang.Object) @bci=149, line=442 (Compiled frame)
 - rethinkdb.query$run.invokeStatic(java.lang.Object, java.lang.Object, clojure.lang.ISeq) @bci=51, line=1080 (Compiled frame)
 - rethinkdb.query$run.doInvoke(java.lang.Object, java.lang.Object, java.lang.Object) @bci=12, line=1079 (Compiled frame)

I was hoping someone more familiar with the network code could take a look and tell me if this looks like RethinkDB's fault, or like a potential problem in clj-rethinkdb. Is it something we could protect against? (by disconnecting and reconnecting, for example?)

I know that RethinkDB isn't completely stuck, because I can connect using the admin interface, issue queries, and get replies.

danielcompton commented 5 years ago

I haven’t experienced what you’re describing. What’s the network connection between the two servers? Which version of the library are you using? Are the queries or the results that are hanging unusual or overly large?

I think my next step if this was me would be to raise all of the logging around the connection handling and parsing responses of queries.

jwr commented 5 years ago

Both the app and RethinkDB run on a single host. RethinkDB has a (non-voting) replica elsewhere, but the app does not access that. Until recently, I was able to run the same setup for months without problems.

I am using 0.15.26 and I could not identify any specific change that triggered the problem. The application might hang after running for days, hours, or right after restarting. The only potentially relevant change that I could find is the number of changefeeds: it usually hovered around 200, but recently grew to about 300. And restarting the application causes (most) changefeeds to be re-established.

Unfortunately, since this is non-reproducible and non-deterministic, all I have to go on is the jstack dump. There is nothing in RethinkDB logs, nothing in application logs, no exceptions, errors, no visible failures. To make sure, I implemented a default global uncaught exception handler, but it doesn't get called. And RethinkDB seems to respond in the admin interface, too.

This kind of bug is, unfortunately, catastrophic. I don't know if that's a bug in RethinkDB or clj-rethinkdb, but this might finally force me to move to FoundationDB.

danielcompton commented 5 years ago

I don't have any good suggestions on fixes sorry. I'd maybe recommend forking this library to add in logging statements throughout the networking/protocol code to see if you can narrow down where it's getting stuck.

jwr commented 5 years ago

Yes, these kinds of issues are the worst to debug. In the meantime I've come to think that this might be a RethinkDB problem, because if I restart the app after a hang, I usually get a similar hang shortly after it starts up. Rebooting the entire machine or restarting RethinkDB and then the application does not seem to cause another hang.

The BLOCKED state of the threads might indicate a locking issue with manifold, too, but it would need to be something triggered by RethinkDB itself.

I don't think debugging the whole stack is realistic for me.

jwr commented 5 years ago

I found the root cause of this problem and I thought I'd leave a post-mortem, perhaps it will help others avoid falling into a similar trap.

As it turned out, the problem was a deadlock caused by insufficient threads in the core.async thread pool. A careful audit of the code showed that in many places functions called from go blocks eventually performed blocking I/O, which should be avoided at all costs. go blocks are intended for short-running code only and should delegate all extensive work to separate threads.

I am unclear on why the problem manifested itself in this particular way, I also thought that manifold does not use the same thread pool as core.async. But perhaps it was waiting for something that code written using core.async was supposed to provide and that's why stack traces pointed to manifold.

An immediate emergency remedy is to increase the size of the core.async thread pool by adding a -Dclojure.core.async.pool-size=N parameter to the command line. The correct fix is to audit the code, find all places where blocking operations are initiated from go blocks, and eliminate those. As it turns out, these aren't always immediately obvious: for example, some code several calls down the stack might need to get configuration data and thus perform blocking I/O.

TL;DR takeaway: do not do any blocking or CPU-bound work in go blocks.

apa512 / clj-rethinkdb

Communication with RethinkDB freezing/hanging completely #198