Closed liqiangxl closed 3 weeks ago
!build
!build
!build
!build
!build
Revised to ensure the correct axis is used.
// non-concretized broadcast domains are moved to the innermost before
// transform propagation, should skip these axes.
int64_t vect_axis_pos = -1;
while (tv->axis(vect_axis_pos)->isBroadcast()) {
vect_axis_pos --;
NVF_ERROR(
vect_axis_pos + tv->nDims() >= 0,
"Out of bound access when visiting dim ",
vect_axis_pos,
" in Tv: ",
tv->toString());
}
!build
Issue InnerOuter persistent scheduler uses shared memory to store persistent buffers, the data flow is
input in gmem ---> async copy to smem --> vectorized load to registers (smem consumers)
, the-->
are simplyLoadStoreOp
and same vectorization factors of these two copies are used. CI found a case where the shared memory persistent buffers have a data type of fp32 while the inputs are fp16 (when there are view ops, project to inputs is not used). The vectorization factor is set to 8 and caused 32 bytes vectorization when loading from shared memory to registers.Changes: (1) Added code to handle the vectorization of smem consumers. Add an additional split if
smem --> regs
copy leads to vectorization larger than 16 bytes. (2) Added a testResults: Ensure vectorizations are <= 16 bytes.
Following works See issue https://github.com/NVIDIA/Fuser/issues/3272