'invalid device context' and other concurrency issues

robeverest commented 9 years ago

This was talked about in #227 and in AccelerateHS/accelerate-cuda#11, but I figured it was best to create a separate ticket for it as it didn't really have one dedicated to it. I suspect that this is also strongly related to #260.

The good news is I think I have found the problem. The bad news is that the only solution I have come up with has a performance penalty and requires rethinking a few things. Basically, we never ensure that all our calls to the CUDA API are from bound threads. Because CUDA depends on thread local state (the "context"), we should be using forkOS instead of forkIO. This is unfortunate as a context switch between bound threads is significantly more expensive than forkIO threads.

The other problem this brings up is CUDA API calls in finalizers. Finalizers are also not run in bound threads. The simplest solution to this is to use runInBoundThread in the finalizers themselves, but given the number of finalizers firing now, I think it would be a significant performance hit to do that. I believe that by caching resources (like events) I can get it so that the only time finalizers will need to be bound is at program exit, where the performance cost is not such a problem. I think that would be a better solution.

Unless anyone has a better way of solving this (looking mostly at you @tmcdonell), I'm going to go ahead and make the necessary changes.

tmcdonell commented 9 years ago

Having many forkOS (or equivalently runInBoundThread) calls will have a large performance impact as you mention. Additionally, there are a limited number of OS threads available, and if all the finaliser and other operations turn into OS threads, I think we are going to hit that limit for any reasonably sized or long running program [1].

Also, I am concerned that the runInBoundThread idea is going to fall over if we allow the Haskell RTS to spawn multiple threads (i.e. +RTS -N2) because then being bound to an OS thread doesn't necessarily mean we are bound to the right OS thread, w.r.t. CUDA's thread local state (although I think a currently unbound CUDA context (r.e. push and pop) is allowed to migrate? They made a big deal a while back about playing nicer in threaded programs, but I can't remember exactly). [2]

Perhaps have a single thread created via forkOn / forkOS and pinned (will we need +RTS -qa?), and then all the other main/finaliser threads send it IO () actions to execute? This should fit in to the current execution model (which sends the CUDA runtime some asynchronous task graph), but getting result values out (e.g. malloc) would be more annoying. I think this should also scale well to multiple devices (each device/context gets its own OS worker thread) or having run called from several Haskell threads (since everything gets funnelled down/serialised to the same worker). If we have one bound/OS thread (per context) doing all the CUDA API calls and all others are lightweight forkIO threads sending it actions to execute, do we still have a high context-switch overhead? I'm sure there are many other things I haven't considered, and it is a large change, so we should think about this more before proceeding.

[1] Due to --- surprise --- finaliser related bugs in the llvm backend, we hit this limit pretty quickly running no fib. [2] I'll note that running an application with +RTS -N2 currently fails under nvprof, but I'm not sure who to blame here. This might be a CUDA/nvprof problem since the program seems to run fine otherwise, but maybe we really are screwing up when more than one processor / real OS thread is available and it is just that nvprof catches that error and barfs. I don't know...

robeverest commented 9 years ago

I think we only need to have as many forkOS threads as there are calls to the run*Async variants from the main thread or calls to run* from other threads. Because the main thread is always bounded, we can adjust the non-asynchronous run* calls to use runInBoundThread so that they don't fork an OS thread unless absolutely necessary. This would cover 90% of our use cases. As for finalizers, if we have tables for all the resources that we use (arrays, streams and events), then the finalizers only need to insert these resources back into their corresponding table, not making any CUDA API calls. The only time a finalizer would need to actually destroy resources is when the tables themselves finalize, which only happens at program exit. The other case when resources need to be destroyed, arrays when we run out of memory, happens in the thread that called malloc, not a finalizer. For these reasons, there is no reason a frequently occurring finalizer should need to use runInBoundThread or forkOS.

I'm not sure having a single bound thread that is sent actions to execute would work out so well. Firstly, as you say, things like malloc which have a return value become problematic. I think we actually make a lot of calls like that, including creating streams and events, querying events, waiting on events and simple things like getting the current memory usage. Secondly, all CUDA API calls can raise exceptions and we'd have to handle that in some way. It would be painful to debug if exceptions were delayed till after they occurred. Lastly, and I think most importantly, I'm not sure having multiple haskell threads sending work off to, and waiting on, a single bound thread is actually any better performance-wise than having many bound threads. By waiting on the bound thread we'd essentially be forcing these expensive context switches to happen. If we had multiple bound threads working independently, I imagine the scheduler would distribute them better across the available cores and reduce the number of context switches.

I have a very simple example here that highlights the issue at hand:

module Main where

import Foreign.CUDA.Driver.Context  hiding ( device )
import Foreign.CUDA.Driver.Device   ( device, initialise )

import Control.Concurrent           ( forkOn, forkIO, forkOS, yield )
import Control.Concurrent.MVar

main = do
  var <- newEmptyMVar
  forkIO $ do
    initialise []
    dev <- device 0
    ctx <- create dev []
    yield
    pop
    putMVar var ()
  readMVar var

If you compile this with -threaded and -prof (why this is necessary, I don't know) then run it through cuda-gdb, it will fail about 25% of the time. There are a few things I have observed:

Changing forkIO to forkOS fixes the problem as expected. Using forkOn does not. While forkOn threads have affinity, they are not bound.
Provided forkOS is used, using +RTS -N4 is also not a problem, even when you make the example more complicated and have it create fork many threads. It seems that as long as every thread created with forkOS has the right context pushed onto it, things work fine. It does indeed look like, and also from what I have read, that contexts can migrate easily and also be attached to multiple different threads.
nvprof doesn't seem to pick up on any problems, but I suspect that's more of a happy coincidence than anything else.

Are there any particular examples you can think of where enough calls to run* are made concurrently that they are likely to hit the limit of available OS threads? Nofib is the obvious one, but that's not a very realistic use case and we can change the way it works if necessary.

robeverest commented 9 years ago

I should also add that if we find that we still run out of OS threads, even when they're not used for finalizers, we could possibly overcome that by managing our own thread pool. It still isn't ideal, but it would at least remove that particular problem.

tmcdonell commented 6 years ago

You know, I think these are all fixed now; I have not seen these problems in a long time.

AccelerateHS / accelerate

'invalid device context' and other concurrency issues #261