Tracks performance issues related to inner reduction scheduler

Identified some inner reduction performance issues: (1) warp reduction requires bdimx * inner_reduction_unroll_factor != inner_most_dimension_numel which disables the use of warp reduction for cases with inner dim of e.g. 4096 since 4096 = 512 * 8. Should remove this restriction and use warp reduction for these cases. Addressed in PR #3288

(2) bdimx is set as static in rparams->launch para, but not used as static when scheduling the fusion. Should use static bdimx.

(3) If bdimx is static, warp reduction should accept a static bdimx through template para.

(4) for block reduciton, the reduction dim is parallelized as Serial, TIDx, Vectorization, however, when seeting TIDx, didn't consider influence of quantization, e.g. at 5120, Vectorization = 8, TIDx = 512, Serial = 1.25 --> rounded up to 2, at the 2nd iteration, only 25% of the threads are used. Should optimize to Vectorization = 8, TIDx = 128, Serial = 5 and leave TIDy = 4 for iteration dims when the iteration dim is large enough to ensure there are still 4*sm blocks.

(5) unroll on top of vectorization should also be considered. Extending from Serial, TIDx, Vectorization to Serial, Unroll, TIDx, Vectorization

NVIDIA / Fuser

Tracks performance issues related to inner reduction scheduler #3293