Closed liqiangxl closed 1 week ago
!test --pybench
Needs to change __launch_bounds__
based on launch params, current main branch just needs to change compile params, which is more convenient since it does't need to modify the string of the kernel.
code changes:
getMaxRegCount
.__launch_bounds__(max threads per block)
if heuristics uses static threads per block, this is required, otherwise, register usage may exceed hardware limitwhy: It is the responsibility of nvrct to set the register count per thread to ensure there are enough registers to launch the kernel. nvFuser should avoid setting a large value which may lead to suboptimal ptx/sass code and lower occupancy. For example, nvFuser derives register count based on one block per sm, however, if leave this to nvrtc, we may get two blocks per sm.
Needs to pass the static threads per block to compiler, otherwise, register usage may exceed hardware limit since it doesn't know number of threads in this block.
Test
NVFuserTest.FusionMagicSchedulerSoftmax_CUDA
failedIn
scheduleAndRun
we should add compile params toke->compile(fusion, runtime_inputs, heuristic_params->lparams);
to ensure the correct register count is used. But even without that, I was thinkingnvrtc
should be smart enough to set a resonable register count, unfortunatelly, it doesn't do that in this case because it doesn't know the info of threads per block.