JuliaGPU / KernelAbstractions.jl

Heterogeneous programming in Julia
MIT License
347 stars 62 forks source link

Dynamic parallelism #442

Open simone-silvestri opened 6 months ago

simone-silvestri commented 6 months ago

I am trying to set up a dynamic kernel wherein a KA kernel launches a CUDA kernel. The final objective would be to have dynamic parallelism using only kernel abstractions. This is a MWE showing the comparison between launching the parent kernel with CUDA or with KA

the child kernel

function child!(a)
    i = threadIdx().x
    @inbounds a[i] = i
    return nothing
end

CUDA implementation (runs)

function parent!(a)
    @cuda dynamic=true threads=10 blocks=1 child!(a)
    return nothing
end

a = CuArray(zeros(10))

kernel! = @cuda launch=false maxthreads=10 always_inline=true parent!(a)

kernel!(a; threads=1, blocks=1)

KA implementation

@kernel function parent!(a)
    @cuda dynamic=true threads=10 blocks=1 children!(a)
end

a = CuArray(zeros(10))

kernel! = parent!(CUDA.CUDABackend(), 1, 1)

kernel!(a)

returns

JIT session error: Symbols not found: [ cudaGetErrorString ]
JIT session error: Failed to materialize symbols: { (JuliaOJIT, { julia_throw_device_cuerror_3299 }) }
JIT session error: Failed to materialize symbols: { (JuliaOJIT, { julia_#_#14_3295 }) }
JIT session error: Symbols not found: [ cudaGetErrorString ]
JIT session error: Failed to materialize symbols: { (JuliaOJIT, { julia_throw_device_cuerror_3306 }) }
ERROR: a CUDA error was thrown during kernel execution: invalid configuration argument (code 9, cudaErrorInvalidConfiguration)
ERROR: a exception was thrown during kernel execution.
Stacktrace:
 [1] throw_device_cuerror at /home/ssilvest/.julia/packages/CUDA/35NC6/src/device/intrinsics/dynamic_parallelism.jl:20
 [2] #launch#950 at /home/ssilvest/.julia/packages/CUDA/35NC6/src/device/intrinsics/dynamic_parallelism.jl:27
 [3] launch at /home/ssilvest/.julia/packages/CUDA/35NC6/src/device/intrinsics/dynamic_parallelism.jl:65
 [4] #868 at /home/ssilvest/.julia/packages/CUDA/35NC6/lib/cudadrv/execution.jl:136
 [5] macro expansion at /home/ssilvest/.julia/packages/CUDA/35NC6/lib/cudadrv/execution.jl:95
 [6] macro expansion at ./none:0
 [7] convert_arguments at ./none:0
 [8] #cudacall#867 at /home/ssilvest/.julia/packages/CUDA/35NC6/lib/cudadrv/execution.jl:135
 [9] cudacall at /home/ssilvest/.julia/packages/CUDA/35NC6/lib/cudadrv/execution.jl:134
 [10] macro expansion at /home/ssilvest/.julia/packages/CUDA/35NC6/src/compiler/execution.jl:219
 [11] macro expansion at ./none:0
 [12] #call#1045 at ./none:0
 [13] call at ./none:0
 [14] #_#1061 at /home/ssilvest/.julia/packages/CUDA/35NC6/src/compiler/execution.jl:371
 [15] DeviceKernel at /home/ssilvest/.julia/packages/CUDA/35NC6/src/compiler/execution.jl:371
 [16] macro expansion at /home/ssilvest/.julia/packages/CUDA/35NC6/src/compiler/execution.jl:88
 [17] macro expansion at /home/ssilvest/test.jl:46
 [18] gpu_parent! at /home/ssilvest/.julia/packages/KernelAbstractions/WoCk1/src/macros.jl:90
 [19] gpu_parent! at ./none:0

Is this expected? I guess it might be a problem of KA setting up maxthreads=1 in the kernel call

vchuravy commented 6 months ago

Slightly confusing, so not expected.

In my experience dynamic parallelism doesn't have the best performance and of course we will need to figure out what it means for at least one different backend.