Open MikeInnes opened 5 years ago
I'm working on some of the necessary CUDAdrv improvements over at https://github.com/JuliaGPU/CUDAdrv.jl/pull/133.
Part of the challenge is that only on very modern Linux system any malloc
is valid. One pretty much anything else you need to use cudaMalloc
:/
Part of the challenge is that only on very modern Linux system any
malloc
is valid.
Is there even a version of Linux & CUDA where this works? Sure, HMM is merged in 4.14, but it doesn't work on CUDA 10 + Linux 4.19.
Furthermore, it's not like unified memory is a magic bullet. Workloads that flips between CPU and GPU will still be similarly slow as the current allowscalar(true)
, so I think one would prefer a hard and clear failure when that happens.
Widely-available HMM definitely seems like the major blocker. I think it's worth exploring whether some workarounds are possible. For example, we could swap out Julia's default malloc
, (and even swap out all existing pointers when CuArrays is loaded). This seems technically feasible though I don't know if there are downsides to using cudaMalloc by default for all allocations.
If the major downside to this approach is that we have a little extra work to turn slow code into failures/warnings, that seems like an OK position to be in. If cuda
is a compiler pass there's plenty of good tooling and diagnostics we can build around that pretty easily.
a little extra work to turn slow code into failures/warnings
Except that those cases would become very hard to spot. As soon as some shared pointer leaks (which wouldn't be limited to CuArray
<-> Ptr
conversions, since anything CPU-allocated can leak into GPU code and vice versa) there's the risk of slowing down computation, causing memory traffic, etc.
Isn't the higher abstraction level much more suited for capturing inputs and uploading them to the GPU? I haven't been following Flux.jl, but I think I greatly prefer improving it as opposed to betting on unified memory (performance cost: unknown) and hoping we don't make things even harder to reason about.
I think that's where we need some empirical testing, to see how likely this really is to trip people up. My feeling is that while those cases are possible, they are going to be much less common than just running a few simple matmuls in a clearly scoped block, which is going to work fine and have far fewer hazards than the current model. The cost of running the experiment seems low for the potential gains -- and we can decide whether to bet the farm on it later.
FWIW what I'm proposing is also significantly different from the CUDA C unified programming model, where CPU and GPU kernels can be pretty freely mixed, and closer to what we have now. Kernels don't have to be allowed outside a cuda
block and scalar indexing can be disabled within it; it can be thought of as simply automating the conversion to CuArray
(indeed that might be one way to prototype it).
Improving Flux is obviously preferable, but I basically think we've hit a wall there. You put conversions in a bunch of places and if it's slightly wrong you go out of memory or get an obscure error. The TensorFlow-style approach takes control of that for you at a very high cost to usability (that's why we're here, after all). Unified memory is the only way I can see to get the best of all worlds, though of course I'm very open to other suggestions.
My issue title was misleading and unclear; unified memory is kind of beside the point here, it's just one implementation of a better CUDA programming model (and possibly not the best one).
We discussed this a bit today and came to the conclusion that prototyping this as a simple compiler pass is the right way to try it out. There are various other things – e.g. better array abstractions in Base – that we may need for the full story, but that's a start. I may get time to prototype something soon.
Anyone interested in hacking on this is welcome to reach out and I can help with that too.
Duplicating https://github.com/FluxML/Flux.jl/issues/706 here so that the right people can see it. I think the GPU maintainers generally agree that this is a good idea (please say if not) but we haven't written it down anywhere yet. Ideally we can work out some forward path for putting some effort into this.