Open pedrovalerolara opened 1 month ago
@maleadt how does CUDA make array (cuda_pointer_ret = CuArray(pointer_ret)
) accessible on multiple GPUs in this case?
@pxl-th mentioned the CU_MEMHOSTALLOC_PORTABLE CUDA flag. Can we use that in AMDGPU?
You can, but at the moment it is not pretty:
bytesize = prod(dims) * sizeof(T)
buf = AMDGPU.Runtime.Mem.HostBuffer(bytesize, AMDGPU.HIP.hipHostAllocPortable)
amdgpu_pointer_ret = ROCArray{T, N}(AMDGPU.DataRef(AMDGPU.pool_free, AMDGPU.Managed(buf)), dims)
# Copy from CPU array.
copyto!(amdgpu_pointer_ret, pointer_ret)
But this is different from CUDA. What CUDA devices do you use? Maybe they have unified memory?
Thank you @pxl-th !! Regarding NVIDIA systems, I am using two different systems, both with two GPUs, one with A100s and the other with H100s. I am not doing anything special for unified memory. No idea is CUDA.jl is doing something in that regard.
So sorry @pxl-th , but I do not understand well your code. Can you use the variable names that I used in my code to help me understand where I have to make the modifications?
I think that I must be using an old version of AMDGPU because I cannot find AMDGPU.pool_free and AMDGPU.Managed. The version of AMDGPU that I am using is 0.8. May I need to use a more modern version?
Yes, you should use AMDGPU 1.0, it has important multi-GPU fixes.
Here's the code, I don't have access to multi-gpu system at the moment, but at least on 1 GPU it works:
using AMDGPU
"""
Create a ROCArray that is accessible from different GPUs (a.k.a. portable).
"""
function get_portable_rocarray(x::Array{T, N}) where {T, N}
dims = size(x)
bytesize = sizeof(T) * prod(dims)
buf = AMDGPU.Mem.HostBuffer(bytesize, AMDGPU.HIP.hipHostAllocPortable)
ROCArray{T, N}(AMDGPU.GPUArrays.DataRef(AMDGPU.pool_free, AMDGPU.Managed(buf)), dims)
end
function main()
ndev = 2
pointer_ret = Vector{AMDGPU.Device.ROCDeviceVector{Float64,AMDGPU.Device.AS.Global}}(undef, ndev)
# Fill `pointer_ret` with pointers here.
amdgpu_pointer_ret = get_portable_rocarray(pointer_ret)
@show amdgpu_pointer_ret
return
end
how does CUDA make array (
cuda_pointer_ret = CuArray(pointer_ret)
) accessible on multiple GPUs in this case?
Assuming the buffer type used here is device memory (which is the default), CUDA.jl enables P2P access between devices when doing the conversion of CUDA.Managed
(a struct wrapping buffers, keeping track of the owning device and stream that last accessed the memory) to a pointer: https://github.com/JuliaGPU/CUDA.jl/blob/69043ee42f4c6e08a12662da4d0537b721eeee84/src/memory.jl#L530-L573
Note that this isn't guaranteed to always work; the devices need to be compatible, or P2P isn't supported. In that case the user is responsible for staging through the CPU (by explicit copyto!
), or by using unified or host memory which is available on all devices automatically.
Thanks @pxl-th and @maleadt for your comments!!! I am using the code showed below that incorporates the comments from @pxl-th on one node of Frontier (8 x AMD GPUs per node). The code works well now. Thank you!! However, when I run the kernels I see this:
julia> @roc groupsize=threads gridsize=blocks multi_scal(s_arrays, 3, alpha, ret[1])
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
AMDGPU.Runtime.HIPKernel{typeof(multi_scal), Tuple{Int64, Int64, Float64, AMDGPU.Device.ROCDeviceVector{AMDGPU.Device.ROCDeviceVector{Float64, 1}, 1}}}(multi_scal, AMDGPU.HIP.HIPFunction(Ptr{Nothing} @0x0000000007d7d6a0, AMDGPU.HIP.HIPModule(Ptr{Nothing} @0x00000000082566b0), Symbol[]))
Do you know why? I am using a local Julia installation (1.10.4) and the 1.0 version of AMDGPU.
function get_portable_rocarray(x::Array{T, N}) where {T, N}
dims = size(x)
bytesize = sizeof(T) * prod(dims)
buf = AMDGPU.Mem.HostBuffer(bytesize, AMDGPU.HIP.hipHostAllocPortable)
ROCArray{T, N}(AMDGPU.GPUArrays.DataRef(AMDGPU.pool_free, AMDGPU.Managed(buf)), dims)
end
function multi_scal(N, dev_id,alpha,x)
i = (workgroupIdx().x - 1) * workgroupDim().x + workitemIdx().x
if i <= N
@inbounds x[dev_id][i]*=alpha
end
return nothing
end
x = ones(800)
alpha = 2.0
ndev = length(AMDGPU.devices())
ret = Vector{Any}(undef, 2)
AMDGPU.device!(AMDGPU.device(1))
s_array = length(x)
s_arrays = ceil(Int, s_array/ndev)
array_ret = Vector{Any}(undef, ndev)
pointer_ret = Vector{AMDGPU.Device.ROCDeviceVector{Float64,AMDGPU.Device.AS.Global}}(undef, ndev)
numThreads = 256
threads = min(s_arrays, numThreads)
blocks = ceil(Int, s_arrays / threads)
for i in 1:ndev
AMDGPU.device!(AMDGPU.device(i))
array_ret[i] = AMDGPU.ROCArray(x[((i-1)*s_arrays)+1:i*s_arrays])
pointer_ret[i] = AMDGPU.rocconvert(array_ret[i])
end
AMDGPU.device!(AMDGPU.device(1))
amdgpu_pointer_ret = get_portable_rocarray(pointer_ret)
copyto!(amdgpu_pointer_ret, pointer_ret)
ret[1] = amdgpu_pointer_ret
ret[2] = array_ret
AMDGPU.device!(AMDGPU.device(1))
@roc groupsize=threads gridsize=blocks multi_scal(s_arrays, 1, alpha, ret[1])
AMDGPU.device!(AMDGPU.device(2))
@roc groupsize=threads gridsize=blocks multi_scal(s_arrays, 2, alpha, ret[1])
Ah... That's a bug in AMDGPU.jl with setting features of the compilation target. I'll fix it
Hi folks!
I am working on the multiGPU support of JACC: https://github.com/JuliaORNL/JACC.jl/ For that, I would need to be able to use a single array of pointers that can store pointers to different GPUs.
I opened another issue a few days ago: https://github.com/JuliaGPU/AMDGPU.jl/issues/662 Although that helped me understand the problem better, I still cannot run the test code below. I can run that code on CUDA (I put the CUDA code too, just in case it is useful).
@pxl-th mentioned the CU_MEMHOSTALLOC_PORTABLE CUDA flag. Can we use that in AMDGPU?
Here are the codes: AMDGPU
CUDA