Deadlock when using manifold.time scheduling in application code

gsnewmark commented 6 years ago

Right now default manifold.time scheduler is based on single-thread executor which is used both internally in Manifold itself (e. g., for deferred timeouts) and in public scheduling API (namely manifold.time/in & manifold.time/every). This can lead to nasty deadlock bugs when seemingly unconnected parts of application stop working due to blocking in scheduler. For example, following code will never finish because inner timeout will wait for scheduler thread blocked by the scheduled function itself:

user=> (require '[manifold.deferred :as d])
nil
user=> (require '[manifold.time :as t])
nil
user=> @(t/in 1000 (fn [] @(d/timeout (d/deferred) 100 ::inner-timeout)))

This example is artificial, but we actually had similar situation in real code recently (just with more indirection involved).

manifold.time allows passing of custom clocks/executors to scheduling function (with-clock), so that's not really a major issue. But still should documentation be updated to clearly reflect that using manifold.time in application code as is can cause issues with other parts of Manifold? Alternatively, should two separate clocks be created by default - one for Manifold's own scheduling needs and one for all other manifold.time users (on the first glance I can't see problems with this approach, but maybe I'm missing something)?

joeltio commented 6 years ago

I had the same problem with s/try-take!:

(t/every 2000 #(do (println "hello")
                   (s/try-take! (s/stream) 1000)))

The above code will print hello once, then hang.

As in @gsnewmark's example, it seems that s/try-take! also uses d/timeout.

kachayev commented 5 years ago

@ztellman Initially this code used fixed pool executor resized to the num of cores. num-cores variable is still there, even tho' it's not used right now. What was the reason for switching to a single thread executor? I know that having more threads would not eliminate the problem described above, I'm just curious.

kachayev commented 5 years ago

@joeltio I've tried a few different versions of your code but it seems I cannot reproduce the issue with it. It works fine from the first glance.

@gsnewmark Regarding your example. I know where this issue comes from and I do understand how hard to debug something like that in a large codebase... but semantically, if you block a thread from whatever seems to be an "async" context (whatever that means technically), you always have problems. That's a curse of implementing async flows on top of JVM runtime, you should always keep in mind that there's a thread pool somewhere underneath the API and this thread pool might end up in the situation where all threads are blocked. core.async is a great example of how leaky this abstraction and how fragile the idea to write the program w/o understanding of the underlying threads mechanics. I think it makes sense to mention in the documentation that default scheduling thread pool should be used carefully.

With that being said I think there's no ideal solution here, what we can do:

Mention in the documentation that blocking operations inside chain/catch/in/every etc are very risky and you should think twice what thread pool backs each callback. Not specifically for manifold.time but for all callbacks.
Introduce a new API that doesn't require time/* callbacks to run on a scheduling thread. Just to make sure that schedule thread is doing only one thing: deliver values at some point in time. And the actual execution of callbacks (either provided or added as listeners using chain) happens somewhere else. I.e. time/every-on that requires an executor to be provided, an option to create a deferred that will be realized to a specific value after delay so you can use it with alt (think core.async/timeout channels).

@ztellman what do you think?

joeltio commented 5 years ago

@kachayev My bad, I just tried it and it didn't work too. I can't remember what I was trying to do, but I know that after running the above code, it will print hello only once, though the program will not hang.

The following code without the s/try-take! will print hello 6 times:

(t/every 2000 #(println "hello"))
(Thread/sleep 10000)

However, the following code will print hello once:

(t/every 2000 #(do (println "hello")
                   @(s/try-take! (s/stream) 1000)))
(Thread/sleep 10000)

To make it hang, you can swap out t/every with t/in:

@(t/in 2000 #(do (println "hello")
                 @(s/try-take! (s/stream) 1000)))

The code will print hello once, then hang.

kachayev commented 5 years ago

@joeltio

My bad, I just tried it and it didn't work too.

No problem!

                 @(s/try-take! (s/stream) 1000)))

That's exactly the same problem @gsnewmark described earlier (thread blocking that leads to the appropriate thread pool to be exhausted). I'm still thinking about what could be done here to improve the situation except for a better documentation.

gsnewmark commented 5 years ago

I'm perfectly fine with updating the documentation (as was indicated in the initial report), because right now the most glaring issue is that this behavior is not immediately obvious unless you check the internals of the Manifold. And when you know the reason it's quite easy to circumvent the issue.

clj-commons / manifold

Deadlock when using manifold.time scheduling in application code #148