JuliaGPU / CUDA.jl

CUDA programming in Julia.
https://juliagpu.org/cuda/
Other
1.21k stars 221 forks source link

Rethinking the programming model #143

Open MikeInnes opened 5 years ago

MikeInnes commented 5 years ago

Duplicating https://github.com/FluxML/Flux.jl/issues/706 here so that the right people can see it. I think the GPU maintainers generally agree that this is a good idea (please say if not) but we haven't written it down anywhere yet. Ideally we can work out some forward path for putting some effort into this.

maleadt commented 5 years ago

I'm working on some of the necessary CUDAdrv improvements over at https://github.com/JuliaGPU/CUDAdrv.jl/pull/133.

vchuravy commented 5 years ago

Part of the challenge is that only on very modern Linux system any malloc is valid. One pretty much anything else you need to use cudaMalloc :/

maleadt commented 5 years ago

Part of the challenge is that only on very modern Linux system any malloc is valid.

Is there even a version of Linux & CUDA where this works? Sure, HMM is merged in 4.14, but it doesn't work on CUDA 10 + Linux 4.19.

Furthermore, it's not like unified memory is a magic bullet. Workloads that flips between CPU and GPU will still be similarly slow as the current allowscalar(true), so I think one would prefer a hard and clear failure when that happens.

MikeInnes commented 5 years ago

Widely-available HMM definitely seems like the major blocker. I think it's worth exploring whether some workarounds are possible. For example, we could swap out Julia's default malloc, (and even swap out all existing pointers when CuArrays is loaded). This seems technically feasible though I don't know if there are downsides to using cudaMalloc by default for all allocations.

If the major downside to this approach is that we have a little extra work to turn slow code into failures/warnings, that seems like an OK position to be in. If cuda is a compiler pass there's plenty of good tooling and diagnostics we can build around that pretty easily.

maleadt commented 5 years ago

a little extra work to turn slow code into failures/warnings

Except that those cases would become very hard to spot. As soon as some shared pointer leaks (which wouldn't be limited to CuArray <-> Ptr conversions, since anything CPU-allocated can leak into GPU code and vice versa) there's the risk of slowing down computation, causing memory traffic, etc.

Isn't the higher abstraction level much more suited for capturing inputs and uploading them to the GPU? I haven't been following Flux.jl, but I think I greatly prefer improving it as opposed to betting on unified memory (performance cost: unknown) and hoping we don't make things even harder to reason about.

MikeInnes commented 5 years ago

I think that's where we need some empirical testing, to see how likely this really is to trip people up. My feeling is that while those cases are possible, they are going to be much less common than just running a few simple matmuls in a clearly scoped block, which is going to work fine and have far fewer hazards than the current model. The cost of running the experiment seems low for the potential gains -- and we can decide whether to bet the farm on it later.

FWIW what I'm proposing is also significantly different from the CUDA C unified programming model, where CPU and GPU kernels can be pretty freely mixed, and closer to what we have now. Kernels don't have to be allowed outside a cuda block and scalar indexing can be disabled within it; it can be thought of as simply automating the conversion to CuArray (indeed that might be one way to prototype it).

Improving Flux is obviously preferable, but I basically think we've hit a wall there. You put conversions in a bunch of places and if it's slightly wrong you go out of memory or get an obscure error. The TensorFlow-style approach takes control of that for you at a very high cost to usability (that's why we're here, after all). Unified memory is the only way I can see to get the best of all worlds, though of course I'm very open to other suggestions.

MikeInnes commented 5 years ago

My issue title was misleading and unclear; unified memory is kind of beside the point here, it's just one implementation of a better CUDA programming model (and possibly not the best one).

We discussed this a bit today and came to the conclusion that prototyping this as a simple compiler pass is the right way to try it out. There are various other things – e.g. better array abstractions in Base – that we may need for the full story, but that's a start. I may get time to prototype something soon.

Anyone interested in hacking on this is welcome to reach out and I can help with that too.