I am trying to set up a dynamic kernel wherein a KA kernel launches a CUDA kernel. The final objective would be to have dynamic parallelism using only kernel abstractions. This is a MWE showing the comparison between launching the parent kernel with CUDA or with KA
the child kernel
function child!(a)
i = threadIdx().x
@inbounds a[i] = i
return nothing
end
CUDA implementation (runs)
function parent!(a)
@cuda dynamic=true threads=10 blocks=1 child!(a)
return nothing
end
a = CuArray(zeros(10))
kernel! = @cuda launch=false maxthreads=10 always_inline=true parent!(a)
kernel!(a; threads=1, blocks=1)
KA implementation
@kernel function parent!(a)
@cuda dynamic=true threads=10 blocks=1 children!(a)
end
a = CuArray(zeros(10))
kernel! = parent!(CUDA.CUDABackend(), 1, 1)
kernel!(a)
returns
JIT session error: Symbols not found: [ cudaGetErrorString ]
JIT session error: Failed to materialize symbols: { (JuliaOJIT, { julia_throw_device_cuerror_3299 }) }
JIT session error: Failed to materialize symbols: { (JuliaOJIT, { julia_#_#14_3295 }) }
JIT session error: Symbols not found: [ cudaGetErrorString ]
JIT session error: Failed to materialize symbols: { (JuliaOJIT, { julia_throw_device_cuerror_3306 }) }
ERROR: a CUDA error was thrown during kernel execution: invalid configuration argument (code 9, cudaErrorInvalidConfiguration)
ERROR: a exception was thrown during kernel execution.
Stacktrace:
[1] throw_device_cuerror at /home/ssilvest/.julia/packages/CUDA/35NC6/src/device/intrinsics/dynamic_parallelism.jl:20
[2] #launch#950 at /home/ssilvest/.julia/packages/CUDA/35NC6/src/device/intrinsics/dynamic_parallelism.jl:27
[3] launch at /home/ssilvest/.julia/packages/CUDA/35NC6/src/device/intrinsics/dynamic_parallelism.jl:65
[4] #868 at /home/ssilvest/.julia/packages/CUDA/35NC6/lib/cudadrv/execution.jl:136
[5] macro expansion at /home/ssilvest/.julia/packages/CUDA/35NC6/lib/cudadrv/execution.jl:95
[6] macro expansion at ./none:0
[7] convert_arguments at ./none:0
[8] #cudacall#867 at /home/ssilvest/.julia/packages/CUDA/35NC6/lib/cudadrv/execution.jl:135
[9] cudacall at /home/ssilvest/.julia/packages/CUDA/35NC6/lib/cudadrv/execution.jl:134
[10] macro expansion at /home/ssilvest/.julia/packages/CUDA/35NC6/src/compiler/execution.jl:219
[11] macro expansion at ./none:0
[12] #call#1045 at ./none:0
[13] call at ./none:0
[14] #_#1061 at /home/ssilvest/.julia/packages/CUDA/35NC6/src/compiler/execution.jl:371
[15] DeviceKernel at /home/ssilvest/.julia/packages/CUDA/35NC6/src/compiler/execution.jl:371
[16] macro expansion at /home/ssilvest/.julia/packages/CUDA/35NC6/src/compiler/execution.jl:88
[17] macro expansion at /home/ssilvest/test.jl:46
[18] gpu_parent! at /home/ssilvest/.julia/packages/KernelAbstractions/WoCk1/src/macros.jl:90
[19] gpu_parent! at ./none:0
Is this expected?
I guess it might be a problem of KA setting up maxthreads=1 in the kernel call
In my experience dynamic parallelism doesn't have the best performance and of course we will need to figure out what it means for at least one different backend.
I am trying to set up a dynamic kernel wherein a KA kernel launches a CUDA kernel. The final objective would be to have dynamic parallelism using only kernel abstractions. This is a MWE showing the comparison between launching the parent kernel with CUDA or with KA
the child kernel
CUDA implementation (runs)
KA implementation
returns
Is this expected? I guess it might be a problem of KA setting up
maxthreads=1
in the kernel call