NVIDIA / Fuser

A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")
Other
271 stars 53 forks source link

check vectorization factor of shared memory consumers to avoid illegal vectorization size #3271

Closed liqiangxl closed 3 weeks ago

liqiangxl commented 4 weeks ago

Issue InnerOuter persistent scheduler uses shared memory to store persistent buffers, the data flow is input in gmem ---> async copy to smem --> vectorized load to registers (smem consumers), the --> are simply LoadStoreOp and same vectorization factors of these two copies are used. CI found a case where the shared memory persistent buffers have a data type of fp32 while the inputs are fp16 (when there are view ops, project to inputs is not used). The vectorization factor is set to 8 and caused 32 bytes vectorization when loading from shared memory to registers.

Changes: (1) Added code to handle the vectorization of smem consumers. Add an additional split if smem --> regs copy leads to vectorization larger than 16 bytes. (2) Added a test

Results: Ensure vectorizations are <= 16 bytes.

Following works See issue https://github.com/NVIDIA/Fuser/issues/3272

liqiangxl commented 4 weeks ago

!build

liqiangxl commented 4 weeks ago

!build

liqiangxl commented 3 weeks ago

!build

liqiangxl commented 3 weeks ago

!build

liqiangxl commented 3 weeks ago

!build

liqiangxl commented 3 weeks ago

Revised to ensure the correct axis is used.

    // non-concretized broadcast domains are moved to the innermost before
    // transform propagation, should skip these axes.
    int64_t vect_axis_pos = -1;
    while (tv->axis(vect_axis_pos)->isBroadcast()) {
      vect_axis_pos --;
      NVF_ERROR(
          vect_axis_pos + tv->nDims() >= 0,
          "Out of bound access when visiting dim ",
          vect_axis_pos,
          " in Tv: ",
          tv->toString());
    }
liqiangxl commented 3 weeks ago

!build