reduction with pointwise epilogue (tests, code optimizaiton, remove limitation)

liqiangxl commented 5 months ago

Current reduction scheduler limites types of epilogue pointwise ops can be fused through SchedulerTopologyChecker. It needs further works in the following areas: (1) the tests are missing (2) the generated code may not optimal, e.g. outer reduciton + non-broadcast pointwise is allowed but the loading of the additional inputs for the non-broadcast pointwise is not predicated. The current generated code is:

  float T5[1];
  T5[0] = 0;
  if (b7) {
    T5[0]
       = T1[i6];
  }
  float T6[1];
  T6[0]
    = T2[0]
    + T5[0];
  if ((((((nvfuser_index_t)blockIdx.y) == (((nvfuser_index_t)gridDim.y) + -1)) && (((nvfuser_index_t)threadIdx.y) == 0)) && b7)) {
    T3[i6]
       = T6[0];
  }

The additional input tensor can be loaded during the final if condition check.


  if ((((((nvfuser_index_t)blockIdx.y) == (((nvfuser_index_t)gridDim.y) + -1)) && (((nvfuser_index_t)threadIdx.y) == 0)) && b7)) {
    float T5[1];
    T5[0] = 0;
    if (b7) {
      T5[0]
         = T1[i6];
    }
    float T6[1];
    T6[0]
      = T2[0]
      + T5[0];  
    T3[i6]
         = T6[0];
  }

(3) some of the limitations may be lifted if the scheduler is revised (needs to confirm).

liqiangxl commented 5 months ago

Avoid redundant load from gmem has no noticable influence on performance.

liqiangxl commented 5 months ago

Inner reduction with non-bcast epilogue can be fused into one kernel and faster than segmented version if the kernel launch latency is also acounted. In fused version: the most common case is, each block load 1 element of the non-bcast epilogue tensor. In segmented version: reduction result is dumped to gmem and the 2nd kernel do pointwise op on the reduction result and non-bcast epilogue tensor.

NVIDIA / Fuser

reduction with pointwise epilogue (tests, code optimizaiton, remove limitation) #2063