Using one single array of pointers for multiGPU AMDGPU computation

pedrovalerolara commented 1 month ago

Hi folks!

I am working on the multiGPU support of JACC: https://github.com/JuliaORNL/JACC.jl/ For that, I would need to be able to use a single array of pointers that can store pointers to different GPUs.

I opened another issue a few days ago: https://github.com/JuliaGPU/AMDGPU.jl/issues/662 Although that helped me understand the problem better, I still cannot run the test code below. I can run that code on CUDA (I put the CUDA code too, just in case it is useful).

@pxl-th mentioned the CU_MEMHOSTALLOC_PORTABLE CUDA flag. Can we use that in AMDGPU?

Here are the codes: AMDGPU

function multi_scal(N, dev_id,alpha,x)
  i = (workgroupIdx().x - 1) * workgroupDim().x + workitemIdx().x
  if i <= N
    @inbounds x[dev_id][i]*=alpha
  end
  return nothing
end

x = ones(600)
alpha = 2.0
ndev = length(AMDGPU.devices())
ret = Vector{Any}(undef, 2)

AMDGPU.device!(AMDGPU.device(1))
s_array = length(x)
s_arrays = ceil(Int, s_array/ndev)
array_ret = Vector{Any}(undef, ndev)
pointer_ret = Vector{AMDGPU.Device.ROCDeviceVector{Float64,AMDGPU.Device.AS.Global}}(undef, ndev)

for i in 1:ndev
  AMDGPU.device!(AMDGPU.device(i))
  array_ret[i] = AMDGPU.ROCArray(x[((i-1)*s_arrays)+1:i*s_arrays])
  pointer_ret[i] = AMDGPU.rocconvert(array_ret[i])
end

AMDGPU.device!(AMDGPU.device(1))
amdgpu_pointer_ret = ROCArray(pointer_ret)
ret[1] = amdgpu_pointer_ret
ret[2] = array_ret

numThreads = 256
threads = min(s_arrays, numThreads)
blocks = ceil(Int, s_arrays / threads)

# This works
AMDGPU.device!(AMDGPU.device(1))
@roc groupsize=threads gridsize=blocks multi_scal(s_arrays, 1, alpha, ret[1])

# This does not work
AMDGPU.device!(AMDGPU.device(2))
@roc groupsize=threads gridsize=blocks multi_scal(s_arrays, 2, alpha, ret[1])

CUDA

function multi_scal(N, dev_id,alpha,x)
  i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
  if i <= N
    @inbounds x[dev_id][i]*=alpha
  end
  return nothing
end

x = ones(600)
alpha = 2.0
ndev = length(devices())
ret = Vector{Any}(undef, 2)

device!(0)
s_array = length(x)
s_arrays = ceil(Int, s_array/ndev)
array_ret = Vector{Any}(undef, ndev)
pointer_ret = Vector{CuDeviceVector{Float64,CUDA.AS.Global}}(undef, ndev)

for i in 1:ndev
  device!(i-1)
  array_ret[i] = CuArray(x[((i-1)*s_arrays)+1:i*s_arrays])
  pointer_ret[i] = cudaconvert(array_ret[i])
end

device!(0)
cuda_pointer_ret = CuArray(pointer_ret)
ret[1] = cuda_pointer_ret
ret[2] = array_ret

numThreads = 256
threads = min(s_arrays, numThreads)
blocks = ceil(Int, s_arrays / threads)

# This works
device!(0)
@cuda threads=threads blocks=blocks multi_scal(s_arrays, 1, alpha, ret[1])

# This works too
device!(1)
@cuda threads=threads blocks=blocks multi_scal(s_arrays, 2, alpha, ret[1])

pxl-th commented 1 month ago

@maleadt how does CUDA make array (cuda_pointer_ret = CuArray(pointer_ret)) accessible on multiple GPUs in this case?

pxl-th commented 1 month ago

@pxl-th mentioned the CU_MEMHOSTALLOC_PORTABLE CUDA flag. Can we use that in AMDGPU?

You can, but at the moment it is not pretty:

bytesize = prod(dims) * sizeof(T)
buf = AMDGPU.Runtime.Mem.HostBuffer(bytesize, AMDGPU.HIP.hipHostAllocPortable)
amdgpu_pointer_ret = ROCArray{T, N}(AMDGPU.DataRef(AMDGPU.pool_free, AMDGPU.Managed(buf)), dims)

# Copy from CPU array.
copyto!(amdgpu_pointer_ret, pointer_ret)

But this is different from CUDA. What CUDA devices do you use? Maybe they have unified memory?

pedrovalerolara commented 1 month ago

Thank you @pxl-th !! Regarding NVIDIA systems, I am using two different systems, both with two GPUs, one with A100s and the other with H100s. I am not doing anything special for unified memory. No idea is CUDA.jl is doing something in that regard.

So sorry @pxl-th , but I do not understand well your code. Can you use the variable names that I used in my code to help me understand where I have to make the modifications?

I think that I must be using an old version of AMDGPU because I cannot find AMDGPU.pool_free and AMDGPU.Managed. The version of AMDGPU that I am using is 0.8. May I need to use a more modern version?

pxl-th commented 1 month ago

Yes, you should use AMDGPU 1.0, it has important multi-GPU fixes.

Here's the code, I don't have access to multi-gpu system at the moment, but at least on 1 GPU it works:

using AMDGPU

"""
Create a ROCArray that is accessible from different GPUs (a.k.a. portable).
"""
function get_portable_rocarray(x::Array{T, N}) where {T, N}
    dims = size(x)
    bytesize = sizeof(T) * prod(dims)
    buf = AMDGPU.Mem.HostBuffer(bytesize, AMDGPU.HIP.hipHostAllocPortable)
    ROCArray{T, N}(AMDGPU.GPUArrays.DataRef(AMDGPU.pool_free, AMDGPU.Managed(buf)), dims)
end

function main()
    ndev = 2
    pointer_ret = Vector{AMDGPU.Device.ROCDeviceVector{Float64,AMDGPU.Device.AS.Global}}(undef, ndev)

    # Fill `pointer_ret` with pointers here.

    amdgpu_pointer_ret = get_portable_rocarray(pointer_ret)
    @show amdgpu_pointer_ret
    return
end

maleadt commented 4 weeks ago

how does CUDA make array (cuda_pointer_ret = CuArray(pointer_ret)) accessible on multiple GPUs in this case?

Assuming the buffer type used here is device memory (which is the default), CUDA.jl enables P2P access between devices when doing the conversion of CUDA.Managed (a struct wrapping buffers, keeping track of the owning device and stream that last accessed the memory) to a pointer: https://github.com/JuliaGPU/CUDA.jl/blob/69043ee42f4c6e08a12662da4d0537b721eeee84/src/memory.jl#L530-L573

Note that this isn't guaranteed to always work; the devices need to be compatible, or P2P isn't supported. In that case the user is responsible for staging through the CPU (by explicit copyto!), or by using unified or host memory which is available on all devices automatically.

pedrovalerolara commented 3 weeks ago

Thanks @pxl-th and @maleadt for your comments!!! I am using the code showed below that incorporates the comments from @pxl-th on one node of Frontier (8 x AMD GPUs per node). The code works well now. Thank you!! However, when I run the kernels I see this:

julia> @roc groupsize=threads gridsize=blocks multi_scal(s_arrays, 3, alpha, ret[1])
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
AMDGPU.Runtime.HIPKernel{typeof(multi_scal), Tuple{Int64, Int64, Float64, AMDGPU.Device.ROCDeviceVector{AMDGPU.Device.ROCDeviceVector{Float64, 1}, 1}}}(multi_scal, AMDGPU.HIP.HIPFunction(Ptr{Nothing} @0x0000000007d7d6a0, AMDGPU.HIP.HIPModule(Ptr{Nothing} @0x00000000082566b0), Symbol[]))

Do you know why? I am using a local Julia installation (1.10.4) and the 1.0 version of AMDGPU.

function get_portable_rocarray(x::Array{T, N}) where {T, N}
    dims = size(x)
    bytesize = sizeof(T) * prod(dims)
    buf = AMDGPU.Mem.HostBuffer(bytesize, AMDGPU.HIP.hipHostAllocPortable)
    ROCArray{T, N}(AMDGPU.GPUArrays.DataRef(AMDGPU.pool_free, AMDGPU.Managed(buf)), dims)
end

function multi_scal(N, dev_id,alpha,x)
  i = (workgroupIdx().x - 1) * workgroupDim().x + workitemIdx().x
  if i <= N
    @inbounds x[dev_id][i]*=alpha
  end
  return nothing
end

x = ones(800)
alpha = 2.0
ndev = length(AMDGPU.devices())
ret = Vector{Any}(undef, 2)

AMDGPU.device!(AMDGPU.device(1))
s_array = length(x)
s_arrays = ceil(Int, s_array/ndev)
array_ret = Vector{Any}(undef, ndev)
pointer_ret = Vector{AMDGPU.Device.ROCDeviceVector{Float64,AMDGPU.Device.AS.Global}}(undef, ndev)

numThreads = 256
threads = min(s_arrays, numThreads)
blocks = ceil(Int, s_arrays / threads)

for i in 1:ndev
  AMDGPU.device!(AMDGPU.device(i))
  array_ret[i] = AMDGPU.ROCArray(x[((i-1)*s_arrays)+1:i*s_arrays])
  pointer_ret[i] = AMDGPU.rocconvert(array_ret[i])
end

AMDGPU.device!(AMDGPU.device(1))

amdgpu_pointer_ret = get_portable_rocarray(pointer_ret)
copyto!(amdgpu_pointer_ret, pointer_ret)

ret[1] = amdgpu_pointer_ret
ret[2] = array_ret

AMDGPU.device!(AMDGPU.device(1))
@roc groupsize=threads gridsize=blocks multi_scal(s_arrays, 1, alpha, ret[1])

AMDGPU.device!(AMDGPU.device(2))
@roc groupsize=threads gridsize=blocks multi_scal(s_arrays, 2, alpha, ret[1])

pxl-th commented 3 weeks ago

Ah... That's a bug in AMDGPU.jl with setting features of the compilation target. I'll fix it

JuliaGPU / AMDGPU.jl

Using one single array of pointers for multiGPU AMDGPU computation #663