avoid calc reg count from block size in getMaxRegCount

liqiangxl commented 1 week ago

code changes:

remove code to calculate register count based on one block per sm in getMaxRegCount.
Add kernel launch bound __launch_bounds__(max threads per block) if heuristics uses static threads per block, this is required, otherwise, register usage may exceed hardware limit

why: It is the responsibility of nvrct to set the register count per thread to ensure there are enough registers to launch the kernel. nvFuser should avoid setting a large value which may lead to suboptimal ptx/sass code and lower occupancy. For example, nvFuser derives register count based on one block per sm, however, if leave this to nvrtc, we may get two blocks per sm.

Needs to pass the static threads per block to compiler, otherwise, register usage may exceed hardware limit since it doesn't know number of threads in this block.

Test NVFuserTest.FusionMagicSchedulerSoftmax_CUDA failed

Compile Parameters: index_type = int, maxrregcount = 64, enable_magic_zero = 1, enable_ptxas_verbose = 0

====================================

ptxas info    : 3 bytes gmem, 24 bytes cmem[4]
ptxas info    : Compiling entry function '_ZN71_GLOBAL__N__00000000_32___tmp_kernel_none_f0_c0_r0_g0_cu_fa818d50_5274224nvfuser_none_f0_c0_r0_g0ENS_6TensorIfLi2ELi2EEES1_' for 'sm_80'
ptxas info    : Function properties for _ZN71_GLOBAL__N__00000000_32___tmp_kernel_none_f0_c0_r0_g0_cu_fa818d50_5274224nvfuser_none_f0_c0_r0_g0ENS_6TensorIfLi2ELi2EEES1_
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 82 registers, used 1 barriers, 16 bytes smem, 432 bytes cmem[0], 8 bytes cmem[2]

Launch Parameters: BlockDim.x = 1024, BlockDim.y = -1, BlockDim.z = -1, GridDim.x = 108, GridDim.y = -1, GridDim.z = -1, Smem Size = 4096

In scheduleAndRun we should add compile params to ke->compile(fusion, runtime_inputs, heuristic_params->lparams); to ensure the correct register count is used. But even without that, I was thinking nvrtc should be smart enough to set a resonable register count, unfortunatelly, it doesn't do that in this case because it doesn't know the info of threads per block.

liqiangxl commented 1 week ago

!test --pybench

liqiangxl commented 1 week ago

Needs to change __launch_bounds__ based on launch params, current main branch just needs to change compile params, which is more convenient since it does't need to modify the string of the kernel.

NVIDIA / Fuser

avoid calc reg count from block size in getMaxRegCount #3413