JuliaORNL / JACC.jl

CPU/GPU parallel performance portable layer in Julia via functions as arguments
MIT License
21 stars 13 forks source link

Need more accurate threads per block #58

Open PhilipFackler opened 7 months ago

PhilipFackler commented 7 months ago

I hit a CUDA error about "too many resources" and discovered it's because my kernel required a lot of registers. I found the following answer helpful, but it uses the deprecated CUDAnative package. The maxthreads on cufunction takes the number of registers needed by the kernel into account. Based on that example, here's what I came up with for JACC.parallel_for for single dimension:

function JACC.parallel_for(N::I, f::F, x...) where {I<:Integer,F<:Function}
  parallel_args = (f, x...)
  parallel_kargs = cudaconvert.(parallel_args)
  parallel_tt = Tuple{Core.Typeof.(parallel_kargs)...}
  parallel_kernel = cufunction(_parallel_for_cuda, parallel_tt)
  maxPossibleThreads = CUDA.maxthreads(parallel_kernel)
  threads = min(N, maxPossibleThreads)
  blocks = ceil(Int, N / threads)
  parallel_kernel(parallel_kargs...; threads=threads, blocks=blocks)
end

This works, although it probably needs more exploration.