What support we need in the runtime/code generator to support GPUs

Keno commented 8 years ago

@vtjnash @JeffBezanson and I were discussing the path forward for GPU support in julia today. Now, my knowledge of GPU programming is limited to having read the CUDA guide over Thanksgiving, so please correct any misunderstanding I may have. I will also admit to not yet having looked at the code that you guys have already written, to be able to look at this without bias as to what already exists, so please let me know if you've already thought about all this and concluded it's a bad idea ;).

Fundamentally what I would like to discuss is:

1) What's the execution and memory model that we support for GPU programming in the language. 2) What is required in base julia in terms of code generation and runtime support, i.e. what are the primitives we need to support.

I will now summarize what the three of us discussed today (to the best of my recollection), but this is by no means a complete design or a commitment to implement this, but hopefully it can serve as a starting point for discussion. Fundamentally, we were thinking the execution model may look something like this:

foo(x) = sin(cos(x))

HostArray = rand(10,10)
A  = convert(GPUArray,HostArray) # Implicitly schedules a memcopy
map(foo,A)::GPUArray{Float64}     # Schedules the map on the same stream as the memcopy

# Generate three random GPU arrays, each on their own stream
B = rand(GPUArray,10,10)  
C = rand(GPUArray,10,10)
D = rand(GPUArray,10,10)

E = B*C # Gets scheduled on one of B or C's stream with explicit synchronization
F = D*E # Same

convert(Array,F) # Waits for the computation to complete and copies the memory back to the host
# If you just want the computation to finish, you may call wait(F)

In this scheme, you can think of GPUArrays as futures for values on the GPU, with the task graph being constructed implicitly. I should also point out that the blocking operation mentioned above was intended to be task-blocking rather than thread-blocking. We briefly considered trying to integrate all of this with the regular task scheduler, e.g. by having a higher order task, that's essentially an entry point and a multiplicity, which would then get lowered, to just a bunch of tasks (potentially loops over subranges) on the CPU and scheduled as a kernel on the CPU, but it's not entirely clear that such an approach would work. Still, it might be worth discussing, so I'm mentioning it here.

Now, the next question is assuming that's what we want the API to look like, what are the primitives we need to support in julia to make this usable. I suspect the answer is you want map! with optionally extra arguments as a way to start a kernel, e.g.

outer(u,v) = map!(GPUArray(length(u),length(v)),GPUArray(u),GPUArray(v)) do dims, out, gpuu, gpuv
    # These aren't actual variables, but rather will get lowered down to the appropriate GPU registers
    x,y = dims
    out[x,y] = gpuu[x]*gpuv[y]
end

I suspect you can do everything you want with that, though you may need a dummy GPU array that doesn't actually have data, but specifies a dimension. In order to support this in the code generator, we would add the ability to at runtime add extra execution engines and then keeping a copy of fptr/llvm functions for each execution engine. I'm not entirely sure what the interface for this will be yet, but it'll probably be just an instance of the JuliaOJIT execution engine (see my recent PR), but with a different liking layer to do whatever crazy incantations the GPU driver expects to be able to load code there.

The thing that will have to be in julia base is any modifications to irgen required to support the gpu, which would be handled by querying the execution engine for it's target architecture and making decisions based on that. Fundamentally what we need to avoid in this design is to add a dependency on whatever the GPU vendor's SDK library is to base julia.

cc @SimonDanisch @vchuravy @maleadt @timholy @ViralBShah @tkelman (+ others who I forgot).

vchuravy commented 8 years ago

That sound very exciting. I was wondering if we might be able to do automatic codegen for functions like this:

function f(X, M)
  for i in eachindex(X)
      X[i] *= M[i]
  end
end

In the OpenACC or OpenMP programming model that loop could be hoisted and then executed on the device and this decision could be made through multiple-dispatch. The true power of Julia would shine if I can write a general function and based on either where it is called from (GPUContext) or what its arguments are (GPUArray) it is optimized and executed in the right place.

maleadt commented 8 years ago

The proposed execution model looks interesting and user-friendly, but I would also expose a lower-level interface as many computations are not easily expressed in terms of, e.g., map, scan and reduce. These primitives are also notoriously hard to implement efficiently (e.g. [1]) so I would leave it up to the user to optimize for performance. Fundamentally though, I'm looking forward to having such an interface :smile:

Concerning the codegen, I'm not sure the current ad-hoc approach is really viable though. Sprinkling conditional code around seems pretty fragile, and we often end up fighting against the language (for example disallowing boxing or hoping LLVM optimizes them away, same for generic_apply, etc). As an alternative to adding if target==PTX all over the place, I was considering to try and extract re-usable pieces of functionality from the current codegen, and re-implement a CUDA back-end from scratch using those primitives. In a perfect world, this back-end could be implemented in Julia, and shipped as part of e.g. CUDA.jl, registering itself with the main codegen during initialization. This is obviously a pretty vast effort, and I'm not sure it's really viable, but that's what this brainstorm issue is for. It also doesn't solve the gc issue, which seems inherently incompatible with the GPU execution model.

Keno commented 8 years ago

What lower level primitives would you like to have. I agree that the full generality of having external codegen would be nice, but that seems somewhat of a separate effort. The necessity for having essentially perfect type inference is a bigger problem. It seems like we should be able to do some analysis on the function and tell you whether you'll be able to run it on the device and if not, why not.

Regarding @vchuravy's questions, I think the only way to do something like that would be to have a compiler pass that rewrites it into the appropriate invocation of map if it can prove that there are no inter-loop dependencies, etc, which would could the potentially run the code on the GPU.

tkelman commented 8 years ago

Some of the AST analysis that ParallelAccelerator.jl does might be highly relevant here.

maleadt commented 8 years ago

@keno At the very least a low-level kernel launch interface much like what CUDA exposes nowadays, i.e. launch(function, (dims...), args...; stream=0) with args either bits types or GPU-allocated data structures. It will make it a lot easier to port existing GPU code to Julia.

Some other codegen/runtime changes which come to mind:

per-function target selection: I've added a @target macro, adding a field to LambdaStaticData for the codegen to use.
alternatives for all library-backed functions (most notably openlibm); there's NVPTX intrinsics for many of them, but some of those have different accuracy guarantees or other corner-case behaviour. It also adds a dependency on CUDA SDK's libdevice to the codegen.
disable gc / prevent dynamic allocations in kernels
avoid calls to built-ins, which all live in the CPU's memory and are often GPU incompatible
alternative for exceptions, including all the build-up functions (eg. jl_undefined_var_error -> jl_throw -> throw_internal, see previous point). I tried vsprintf; trap, but that's pretty coarse and the trap seemed to kill the CUDA runtime. I also tried a global symbol to dump information to, but that thoroughly confused LLVM's AA. As a workaround, exceptions are currently forbidden (including, e.g., checked number conversions).
forbid late binding, as it results in boxed types, gc allocations, exceptions, etc.
idea: native address-space support for Ptr. Not strictly necessary, but would avoid if target==ptx emit ptr in addrspace(n) in the codegen. Tried redefining Ptr{T,N=0}, but obviously breaks stuff as !isbits(Ptr{T}).

In a sense, I think that compiling for GPUs needs a stripped down version of the language (at least initially).

timholy commented 8 years ago

It sounds a lot like Reactive's (current) execution model. What happens in a situation where C depends on A and B, and these are running on separate streams? Is there any global optimization, or is everything based on parent relationships?

I agree that gc is a big issue. GPUs want to have static memory management. CUDArt has semi-succeeded in marrying the two styles, but it's still occasionally flaky.

ChrisRackauckas commented 8 years ago

That execution model looks great. It looks a lot like MATLAB's but is one of the things that language did well (except in its multiple GPU use). I think at a minimum the GPUArray should support vectorized functions in their "obvious" way, like you had .* in your example, ./, -, .^, and + should do the same, and also the logical operators .>, .<, etc. Similarly, * and \ should be wrappers to cuBLAS and cuSOLVER. And the standard mathematical functions like fft and cos should dispatch to calling an appropriate version from cuFFT and the Cuda Math Library. Lastly, having the rand function on GPUs use the fast algorithms from the cuRAND library.

This functionality with the execution model that @Keno suggests is pretty much essential for prototyping GPU algorithms in scientific computing. If one is able to implement something like @devectorize to have Julia then take this code and make the "obvious CUDA kernal" then one could do most scientific programming for the GPU in Julia with full efficiency. (There's still the issue of garbage collection here, which I think the programmer do explicitly it's really just the "inner" loops which use GPUs and so one is usually a little more careful with this part anyways).

However, one thing that should be really incorporated into that execution model is the use of multiple GPUs. This has become very common on HPCs, or even the fact that the K80 is a dual GPU makes this relevant very quickly. I think that the easiest way of doing this is that any code which generates a GPUarray has to be given a device (or it defaults to 1)? All vectorized operations first check that the two objects are on the same GPU, or throw an error. I don't know you'd convert from a GPUarray on one device to another though, or doing something like a "distributedGPUarray" (since GPUs run out of memory fast), but it would be cool to just write vectorized code and use however many GPUs on the machine.

ViralBShah commented 8 years ago

Just wanted to point out the recent progress we've had here:

https://github.com/JuliaComputing/ArrayFire.jl

JuliaGPU / meta

What support we need in the runtime/code generator to support GPUs #2