bytedance / flux

A fast communication-overlapping library for tensor parallelism on GPUs.
Apache License 2.0
223 stars 17 forks source link

All gather and reduce scatter on SM80 #3

Closed zheng-ningxin closed 5 months ago

zheng-ningxin commented 5 months ago

The current implementation supports two operations on the SM80 architecture:

  1. Allgather followed by GEMM (General Matrix-Matrix Multiplication)
  2. GEMM followed by Reduce-Scatter

The fused operations demonstrate improved performance compared to invoking GEMM and communication operations separately. This optimization is crucial for high-performance computing tasks especially for LLM training or inference.

We sincerely appreciate all contributors including but not limited to @kongroo @wenlei-bao @houqi @Meteorix @liwenchangbdbz @ZihengJiang @eric-haibin-lin.