NVIDIA / Fuser

A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")
Other
271 stars 53 forks source link

set maxrregcount in outer reduction heuristic #3415

Closed liqiangxl closed 3 days ago

liqiangxl commented 6 days ago

Code Changes In outer reduction heuristic, set register count to achieve 50% occupancy with 1024 threads per SM.

Why nvFuser calculates register count based on one block per sm in getMaxRegCount, if compiler follows this register count, we may lose the opportunity to get 2 or more blocks per sm.

Another approach is to use __launch_bounds__(max threads per block) but that requires modify __launch_bounds__ if launch para is changed, which doesn't seem as convenient as directly setting maxrregcount. Tried in PR-3413, then closed.

Performance changes: see doc.

liqiangxl commented 6 days ago

!test --pybench

jjsjann123 commented 6 days ago

I feel a bigger question is, how's the overall perf impact? should we not worry about register spill?

liqiangxl commented 6 days ago

!test