The current implementation supports two operations on the SM80 architecture:
Allgather followed by GEMM (General Matrix-Matrix Multiplication)
GEMM followed by Reduce-Scatter
The fused operations demonstrate improved performance compared to invoking GEMM and communication operations separately. This optimization is crucial for high-performance computing tasks especially for LLM training or inference.
We sincerely appreciate all contributors including but not limited to @kongroo @wenlei-bao @houqi @Meteorix @liwenchangbdbz @ZihengJiang @eric-haibin-lin.
The current implementation supports two operations on the SM80 architecture:
The fused operations demonstrate improved performance compared to invoking GEMM and communication operations separately. This optimization is crucial for high-performance computing tasks especially for LLM training or inference.
We sincerely appreciate all contributors including but not limited to @kongroo @wenlei-bao @houqi @Meteorix @liwenchangbdbz @ZihengJiang @eric-haibin-lin.