Open satyabhagavan opened 2 weeks ago
what you are referring to is mainloop fusions. This is available in a limited capacity in pre-hopper kernels via the 2.x API, but unlike EVT, the perf cliffs of mainloop fusion are more intricate and require a lot of domain specific knowlege of what exactly is being fused in order to perf optimize, so we have not implemented a "MVT" yet. It is not impossible to do, so if you have a compelling set of use cases, please let us know.
In the meantime, the strategy I can recommend is, always move the preprocessing of the input before GEMM into the epilogue of the prior kernel.
Thank you for your reply. For performing mainloop fusion, I referred to example 26_ampere_wgrad_mainloop_fusion. I compared the fused kernel from this example against decomposed kernels: a custom kernel to handle preprocessing(simple kernel which will do scale, bias and relu on all elements) and a wgrad kernel. I observed that the execution time in the first case is slightly higher, indicating that the fusion is taking more time than the separate preprocessing and wgrad kernels. The workload is 512x64x64x16. I am running this kernel on A10 GPU. Am I missing some thing?
What is your question? In the examples provided, EVT demonstrates the capability to fuse different epilogue functions, optimizing their execution. I'm interested in knowing whether EVT can also integrate the preprocessing steps of the matrix before the GEMM operation. Specifically, can EVT handle a scenario where some preprocessing on the matrix is fused, followed by its integration into the GEMM operation? I want to understand if EVT supports such a fusion of preprocessing and GEMM operations.