NVIDIA / cutlass

CUDA Templates for Linear Algebra Subroutines
Other
4.84k stars 835 forks source link

[QST] Available Fusion Options in EVT #1595

Open satyabhagavan opened 2 weeks ago

satyabhagavan commented 2 weeks ago

What is your question? In the examples provided, EVT demonstrates the capability to fuse different epilogue functions, optimizing their execution. I'm interested in knowing whether EVT can also integrate the preprocessing steps of the matrix before the GEMM operation. Specifically, can EVT handle a scenario where some preprocessing on the matrix is fused, followed by its integration into the GEMM operation? I want to understand if EVT supports such a fusion of preprocessing and GEMM operations.

thakkarV commented 2 weeks ago

what you are referring to is mainloop fusions. This is available in a limited capacity in pre-hopper kernels via the 2.x API, but unlike EVT, the perf cliffs of mainloop fusion are more intricate and require a lot of domain specific knowlege of what exactly is being fused in order to perf optimize, so we have not implemented a "MVT" yet. It is not impossible to do, so if you have a compelling set of use cases, please let us know.

In the meantime, the strategy I can recommend is, always move the preprocessing of the input before GEMM into the epilogue of the prior kernel.

satyabhagavan commented 2 weeks ago

Thank you for your reply. For performing mainloop fusion, I referred to example 26_ampere_wgrad_mainloop_fusion. I compared the fused kernel from this example against decomposed kernels: a custom kernel to handle preprocessing(simple kernel which will do scale, bias and relu on all elements) and a wgrad kernel. I observed that the execution time in the first case is slightly higher, indicating that the fusion is taking more time than the separate preprocessing and wgrad kernels. The workload is 512x64x64x16. I am running this kernel on A10 GPU. Am I missing some thing?