JuliaORNL / JACC.jl

CPU/GPU parallel performance portable layer in Julia via functions as arguments
MIT License
21 stars 13 forks source link

Adding target option #62

Open michel2323 opened 6 months ago

michel2323 commented 6 months ago

This adds a target option to the parallel function calls. For CUDA:

JACC.parallel_for(CUDABackend(), N, axpy, alpha, x_device_JACC, y_device_JACC)

The GPU packages provide these backends. JACC then defines ThreadsBackend() in addition to those.

Doing it this way should resolve precompilation error, while also resolving https://github.com/JuliaORNL/JACC.jl/issues/56 . In addition, there is no need to set preferences anymore and the various backends can be used concurrently in a code. Also no need for a JACC.Array type. This tries to imitate the target offload pragma of OpenMP.

@PhilipFackler Let me know if there are any further issues with this solution.

Edit: These backends are also used by KernelAbstractions (except ThreadsBackend(), of course), so it would be easy now to write, for example, some GPU kernels in KA that don't require backend-specific functionality.

williamfgc commented 6 months ago

@michel2323 thanks for adding this. I think we need to discuss offline as the changes remove JACC's public API portability across vendors for the same code. Am I seeing this right?

michel2323 commented 6 months ago

@michel2323 thanks for adding this. I think we need to discuss offline as the changes remove JACC's public API portability across vendors for the same code. Am I seeing this right?

I wouldn't say so? In how far? At some point, you have to pick a backend. But the same is true for OpenMP. In Julia you can do this at runtime.

michel2323 commented 6 months ago

If your code only uses one backend, say CUDA, you could have a setup.jl where the user has to pick the backend. Or you could load all backend packages (CUDA.jl, AMDGPU.jl,...) and see which one is functional() (see tests). I think this is great in case you have a mix of AMD and NVIDIA GPUs on one system, for example. The code with the parallel() calls is the same across all vendors.

michel2323 commented 6 months ago

Ah, I see what you mean maybe: the array types CuArray are vendor specific. But there Julia provides already a wonderful solution with the Adapt package that also all backends support.

x = zeros(10)
dx = adapt(backend, x)

So in the case where backend=CUDABackend(), dx will be of type CuArray and you never have to (nor should) use the vendor specific types.

For @PhilipFackler this would also make it easier if there's a struct with mixed host and device types. He would only have to define a Adapt.adapt(backend, mystruct) function.

williamfgc commented 6 months ago

I think this is great in case you have a mix of AMD and NVIDIA GPUs on one system

This is mostly a corner case that very rarely comes up, so we should focus on portable code across different vendors. I agree it's a nice to have, but enforcing a specific back end in the public API should be optional (maybe should be a macro?) for corner cases not the rule.

The back end selection follows Preferences.jl just like MPIPreferences.jl, so user code calling JACC (like those in tests) doesn't need to be touched from parallel_for(BackendX, ...) to parallel_for(BackendY,...), especially in code with several calls to parallel_for. In fact, they only need to set LocalPreferences and add an "import XBackend". We can discuss offline.

michel2323 commented 6 months ago

The argument would be a variable backend. If you want, you can make it a global variable or have a default based on what backend is functional. I don't think it's such a corner case since one has at least host and device backends available, and I doubt one wants to run everything on a device.

williamfgc commented 6 months ago

I doubt one wants to run everything on a device

For those cases, the user should rely on Julia regular Arrays and CPU (host) if it's not worth porting, JACC is very targeted for performance portable code pieces.

The argument would be a variable backend. If you want, you can make it a global variable or have a default based on what backend is functional.

That's what JACCPreferences sets, but via LocalPreferences, see this line. The least vendor/system info is exposed to the targeted users (domain scientists) the better.

michel2323 commented 6 months ago

Let me add an example code:

# code in a setup jl or run by the user before running his code
using CUDA

if CUDA.functional()
    backend = CUDABackend()
else
    backend = ThreadsBackend()
end

# application code using JACC which is the same accross all vendors

using JACC

function axpy(i, alpha, x, y)
    if i <= length(x)
        @inbounds x[i] += alpha * y[i]
    end
end

x = adapt(backend, x)
y = adapt(backend, y)

for i in 1:11
    @time JACC.parallel_for(backend, N, axpy, alpha, x, y)
end

# Copy to host
x = adapt(ThreadsBackend(),x)
y = adapt(ThreadsBackend(),y)

So the difference is whether to set preferences or set it in a setup.jl. The preferences solution breaks precompilation with the current API https://github.com/JuliaORNL/JACC.jl/issues/53 . I don't know how else to resolve that.

michel2323 commented 6 months ago

The difference between MPI and the GPU backends is that MPI has the same API across all implementations and the same array types are passed in. For the GPUs that's different.

williamfgc commented 6 months ago

The difference between MPI and the GPU backends is that MPI has the same API across all implementations and the same array types are passed in. For the GPUs that's different.

Yeah, that's the goal of JACC. Users should not interact with back ends (at most minimally like it's done today with Preferences). "JACC-aware" MPI would be a noble goal, though.

michel2323 commented 6 months ago

Another stab at it. This defines a default_backend.

using JACC
# "Default backend is ThreadsBackend()"
println_default_backend()
using CUDA
# "Default backend is CUDABackend()"
println_default_backend()

And then there are parallel methods that pass this down. Of course, if multiple GPU packages are loaded by the user, this will pick whatever extension was compiled last.

Sorry, I really don't know how else to resolve the precompilation issue with Preferences. You can't redefine a method with the same arguments.

michel2323 commented 6 months ago

And now with Preferences support too. So the breaking change is that JACC.Array is gone. That is still the difficult bit as you cannot dispatch on JACC.Array with all backends and have precompilation working.

williamfgc commented 6 months ago

@michel2323 thanks, see discussion in #53 . I am asking @PhilipFackler how to reproduce the error as it's not showing in the current CI. I'd rather keep the public API as simple as possible since back ends can be handled internally and weak dependencies should provide the desired separation.

williamfgc commented 6 months ago

Ideally users should not deal with any detail in the code other than memory allocation and parallel_for and parallel_reduce. Otherwise, there is little advantage in using JACC if the programming model is not that simple (even adapt is too complex for end-users). Today, it works like this:

# Using CUDA triggers weak dependencies JACCCUDA and must match LocalPreferences.toml
using CUDA # the code should work just fine on CPU without this line 
using JACC # I don't know if there is a good way to just import a back end here (e.g. CUDA, AMDGPU, etc.)

function axpy(i, alpha, x, y)
    if i <= length(x)
        @inbounds x[i] += alpha * y[i]
    end
end

x = JACC.Array(round.(rand(Float32, N) * 100))
y = JACC.Array(round.(rand(Float32, N) * 100))
alpha = 2.5

for i in 1:11
    @time JACC.parallel_for(N, axpy, alpha, x, y)
end

# Copy to host...perhaps implement JACC.to_host(x) to avoid deep copies on CPU host and device
x_h = Array(x)
y_h = Array(y)