Open liqiangxl opened 5 months ago
Avoid redundant load from gmem has no noticable influence on performance.
Inner reduction with non-bcast epilogue can be fused into one kernel and faster than segmented version if the kernel launch latency is also acounted. In fused version: the most common case is, each block load 1 element of the non-bcast epilogue tensor. In segmented version: reduction result is dumped to gmem and the 2nd kernel do pointwise op on the reduction result and non-bcast epilogue tensor.
Current reduction scheduler limites types of epilogue pointwise ops can be fused through
SchedulerTopologyChecker
. It needs further works in the following areas: (1) the tests are missing (2) the generated code may not optimal, e.g. outer reduciton + non-broadcast pointwise is allowed but the loading of the additional inputs for the non-broadcast pointwise is not predicated. The current generated code is:The additional input tensor can be loaded during the final if condition check.
(3) some of the limitations may be lifted if the scheduler is revised (needs to confirm).