Open cat-state opened 2 weeks ago
Ah yeah I see that (cuda docs: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/colls.html#c.ncclReduce)
I think in this case due to rust's borrow rules it'd probably be easiest to just add Comm::all_reduce_in_place
that takes a &mut CudaSlice
. Fairly easy add if anyone wants to contribute a PR for this! Otherwise I can add later this week
NCCL supports all-reduce in place, however
Comm::all_reduce
takes in a&CudaSlice
to read from and a&mut CudaSlice
to write into, which doesn't allow in-place reduction.