Auto-tuning workgroupsize when localmem consumption depends on it

JuliaGPU / KernelAbstractions.jl

Heterogeneous programming in Julia

MIT License

363 stars 65 forks source link

Auto-tuning workgroupsize when localmem consumption depends on it #215

Open tkf opened 3 years ago

tkf commented 3 years ago

Does KernelAbstractions.jl support auto-setting workgroupsize when the kernel has local memory size that depends on groupsize? For example, CUDA.launch_configuration takes a shmem callback that maps a number of threads to shared memory used. This is used for implementing mapreduce in CUDA.jl. Since shmem argument for CUDA.launch_configuration is not used in Kernel{CUDADevice}, I guess it's not implemented yet? Is it related to #19?

vchuravy commented 3 years ago

This is #11 KA doesn't support dynamic shared memory.

tkf commented 3 years ago

Does #11 have auto-tuning? I skimmed the code but I couldn't find any. Or it's planned but not implemented?

vchuravy commented 3 years ago

No #11 was started before we added auto-tuning, and stalled since no-one had a clear need for it.

tkf commented 3 years ago

oh, that sounds like I need to give a shot at it if I want it :joy:

I still am not clear how to implement auto-tuning with #11, though. If I write @dynamic_localmem T (workgroupsize) -> expression_with(T, worksize), I also need to have a way to compute T from the arguments to the kernel, which can be arbitrarily complex. Since Cassette operates on untyped IR, isn't it impossible to get T given kernel arguments (types)? Doing this at the macro level is even more hopeless. Also, how about @dynamic_localmem behind an inlinable function call?

If these concerns are legit, maybe we still need the explicit shmem callback-like approach?

tkf commented 3 years ago

I'm in particular interested in the use case combined with pre-launch workgroupsize auto-tuning #216.

bjarthur commented 2 months ago

before we added auto-tuning...

is auto tuning documented? if so, i can't find it.

vchuravy commented 2 months ago

When the workgroupsize=nothing the backend is free to pick a size. Most of the GPU backends have a way to ask for the appropriate size of a compiled kernel (nee Auto-Tuning) and the CPU picks 1024