JuliaORNL / JACC.jl

CPU/GPU parallel performance portable layer in Julia via functions as arguments
MIT License
19 stars 9 forks source link

Improve threads and blocks tuning for 2D parallel_for (CUDA) #100

Open PhilipFackler opened 3 months ago

PhilipFackler commented 3 months ago

Similar to #76 but for 2D. This only applies to the CUDA version.

williamfgc commented 3 months ago

@PhilipFackler does this work for your use case?

williamfgc commented 3 months ago

Test this please

PhilipFackler commented 3 months ago

@williamfgc It worked for the smallish test case I was using for development (one that caused the failure in the first place). I still want to test it out on iguazu with a bigger case.

williamfgc commented 3 months ago

@PhilipFackler thanks, let me know when this is ready to merge.

PhilipFackler commented 3 months ago

@williamfgc I added functors to change the behavior of getting i and j for the kernel function. This solves the problem for my case. However it would be nice to implement a BlockIndexer that used sub-ranges for when the problem size would cause the number of blocks to exceed the maximum.

PhilipFackler commented 3 months ago

@williamfgc I believe this is ready to merge. You'll want to request all the tests again.

PhilipFackler commented 1 month ago

@williamfgc can you merge this?