Add flash attention based on b2b GEMM

I would suggest prepare the implementation of the flash attention algorithm (I prefer calling it an parallel algorithm).

I think flash attention has many implications for thinking about scheduling an efficient DNN computational process. Since it involves a combination of several elements, including the reuse of TCU's output at the register level, warp reduction, element-wise operations, and the arrangement of warps, among others.

Preparing the implementation first will allow us to observe how to organize the structure of the computational process.

TiledTensor / TiledCUDA

Add flash attention based on b2b GEMM #16