I would suggest prepare the implementation of the flash attention algorithm (I prefer calling it an parallel algorithm).
I think flash attention has many implications for thinking about scheduling an efficient DNN computational process. Since it involves a combination of several elements, including the reuse of TCU's output at the register level, warp reduction, element-wise operations, and the arrangement of warps, among others.
Preparing the implementation first will allow us to observe how to organize the structure of the computational process.
I would suggest prepare the implementation of the flash attention algorithm (I prefer calling it an parallel algorithm).
I think flash attention has many implications for thinking about scheduling an efficient DNN computational process. Since it involves a combination of several elements, including the reuse of TCU's output at the register level, warp reduction, element-wise operations, and the arrangement of warps, among others.
Preparing the implementation first will allow us to observe how to organize the structure of the computational process.