Closed yzh119 closed 1 month ago
This PR implements the attention all-reduce kernel which will be used in merging attention states from different GPUs in sequence parallelism.
We use mscclpp for collective communications, thank @liangyurain for teaching me how to use mscclpp.
Co-authored-by: Liangyu Zhao liangyu@cs.washington.edu
This PR implements the attention all-reduce kernel which will be used in merging attention states from different GPUs in sequence parallelism.
We use mscclpp for collective communications, thank @liangyurain for teaching me how to use mscclpp.
Co-authored-by: Liangyu Zhao liangyu@cs.washington.edu