Closed robeverest closed 6 years ago
Having many forkOS
(or equivalently runInBoundThread
) calls will have a large performance impact as you mention. Additionally, there are a limited number of OS threads available, and if all the finaliser and other operations turn into OS threads, I think we are going to hit that limit for any reasonably sized or long running program [1].
Also, I am concerned that the runInBoundThread
idea is going to fall over if we allow the Haskell RTS to spawn multiple threads (i.e. +RTS -N2
) because then being bound to an OS thread doesn't necessarily mean we are bound to the right OS thread, w.r.t. CUDA's thread local state (although I think a currently unbound CUDA context (r.e. push
and pop
) is allowed to migrate? They made a big deal a while back about playing nicer in threaded programs, but I can't remember exactly). [2]
Perhaps have a single thread created via forkOn
/ forkOS
and pinned (will we need +RTS -qa
?), and then all the other main/finaliser threads send it IO ()
actions to execute? This should fit in to the current execution model (which sends the CUDA runtime some asynchronous task graph), but getting result values out (e.g. malloc
) would be more annoying. I think this should also scale well to multiple devices (each device/context gets its own OS worker thread) or having run
called from several Haskell threads (since everything gets funnelled down/serialised to the same worker). If we have one bound/OS thread (per context) doing all the CUDA API calls and all others are lightweight forkIO
threads sending it actions to execute, do we still have a high context-switch overhead? I'm sure there are many other things I haven't considered, and it is a large change, so we should think about this more before proceeding.
[1] Due to --- surprise --- finaliser related bugs in the llvm backend, we hit this limit pretty quickly running no fib.
[2] I'll note that running an application with +RTS -N2
currently fails under nvprof
, but I'm not sure who to blame here. This might be a CUDA/nvprof problem since the program seems to run fine otherwise, but maybe we really are screwing up when more than one processor / real OS thread is available and it is just that nvprof
catches that error and barfs. I don't know...
I think we only need to have as many forkOS
threads as there are calls to the run*Async
variants from the main thread or calls to run*
from other threads. Because the main thread is always bounded, we can adjust the non-asynchronous run*
calls to use runInBoundThread
so that they don't fork an OS thread unless absolutely necessary. This would cover 90% of our use cases. As for finalizers, if we have tables for all the resources that we use (arrays, streams and events), then the finalizers only need to insert these resources back into their corresponding table, not making any CUDA API calls. The only time a finalizer would need to actually destroy resources is when the tables themselves finalize, which only happens at program exit. The other case when resources need to be destroyed, arrays when we run out of memory, happens in the thread that called malloc, not a finalizer. For these reasons, there is no reason a frequently occurring finalizer should need to use runInBoundThread
or forkOS
.
I'm not sure having a single bound thread that is sent actions to execute would work out so well. Firstly, as you say, things like malloc
which have a return value become problematic. I think we actually make a lot of calls like that, including creating streams and events, querying events, waiting on events and simple things like getting the current memory usage. Secondly, all CUDA API calls can raise exceptions and we'd have to handle that in some way. It would be painful to debug if exceptions were delayed till after they occurred. Lastly, and I think most importantly, I'm not sure having multiple haskell threads sending work off to, and waiting on, a single bound thread is actually any better performance-wise than having many bound threads. By waiting on the bound thread we'd essentially be forcing these expensive context switches to happen. If we had multiple bound threads working independently, I imagine the scheduler would distribute them better across the available cores and reduce the number of context switches.
I have a very simple example here that highlights the issue at hand:
module Main where
import Foreign.CUDA.Driver.Context hiding ( device )
import Foreign.CUDA.Driver.Device ( device, initialise )
import Control.Concurrent ( forkOn, forkIO, forkOS, yield )
import Control.Concurrent.MVar
main = do
var <- newEmptyMVar
forkIO $ do
initialise []
dev <- device 0
ctx <- create dev []
yield
pop
putMVar var ()
readMVar var
If you compile this with -threaded
and -prof
(why this is necessary, I don't know) then run it through cuda-gdb
, it will fail about 25% of the time. There are a few things I have observed:
forkIO
to forkOS
fixes the problem as expected. Using forkOn
does not. While forkOn
threads have affinity, they are not bound. forkOS
is used, using +RTS -N4
is also not a problem, even when you make the example more complicated and have it create fork many threads. It seems that as long as every thread created with forkOS
has the right context push
ed onto it, things work fine. It does indeed look like, and also from what I have read, that contexts can migrate easily and also be attached to multiple different threads. nvprof
doesn't seem to pick up on any problems, but I suspect that's more of a happy coincidence than anything else. Are there any particular examples you can think of where enough calls to run*
are made concurrently that they are likely to hit the limit of available OS threads? Nofib is the obvious one, but that's not a very realistic use case and we can change the way it works if necessary.
I should also add that if we find that we still run out of OS threads, even when they're not used for finalizers, we could possibly overcome that by managing our own thread pool. It still isn't ideal, but it would at least remove that particular problem.
You know, I think these are all fixed now; I have not seen these problems in a long time.
This was talked about in #227 and in AccelerateHS/accelerate-cuda#11, but I figured it was best to create a separate ticket for it as it didn't really have one dedicated to it. I suspect that this is also strongly related to #260.
The good news is I think I have found the problem. The bad news is that the only solution I have come up with has a performance penalty and requires rethinking a few things. Basically, we never ensure that all our calls to the CUDA API are from bound threads. Because CUDA depends on thread local state (the "context"), we should be using
forkOS
instead offorkIO
. This is unfortunate as a context switch between bound threads is significantly more expensive thanforkIO
threads.The other problem this brings up is CUDA API calls in finalizers. Finalizers are also not run in bound threads. The simplest solution to this is to use
runInBoundThread
in the finalizers themselves, but given the number of finalizers firing now, I think it would be a significant performance hit to do that. I believe that by caching resources (like events) I can get it so that the only time finalizers will need to be bound is at program exit, where the performance cost is not such a problem. I think that would be a better solution.Unless anyone has a better way of solving this (looking mostly at you @tmcdonell), I'm going to go ahead and make the necessary changes.