Code Changes
In outer reduction heuristic, set register count to achieve 50% occupancy with 1024 threads per SM.
Why
nvFuser calculates register count based on one block per sm in getMaxRegCount, if compiler follows this register count, we may lose the opportunity to get 2 or more blocks per sm.
Another approach is to use __launch_bounds__(max threads per block) but that requires modify __launch_bounds__ if launch para is changed, which doesn't seem as convenient as directly setting maxrregcount. Tried in PR-3413, then closed.
Code Changes In outer reduction heuristic, set register count to achieve 50% occupancy with 1024 threads per SM.
Why nvFuser calculates register count based on one block per sm in
getMaxRegCount
, if compiler follows this register count, we may lose the opportunity to get 2 or more blocks per sm.Another approach is to use
__launch_bounds__(max threads per block)
but that requires modify__launch_bounds__
if launch para is changed, which doesn't seem as convenient as directly setting maxrregcount. Tried in PR-3413, then closed.Performance changes: see doc.