All gather and reduce scatter on SM80

The current implementation supports two operations on the SM80 architecture:

Allgather followed by GEMM (General Matrix-Matrix Multiplication)
GEMM followed by Reduce-Scatter

The fused operations demonstrate improved performance compared to invoking GEMM and communication operations separately. This optimization is crucial for high-performance computing tasks especially for LLM training or inference.

We sincerely appreciate all contributors including but not limited to @kongroo @wenlei-bao @houqi @Meteorix @liwenchangbdbz @ZihengJiang @eric-haibin-lin.

bytedance / flux

All gather and reduce scatter on SM80 #3