Open ZYHowell opened 1 year ago
cc @jiaodong
Goodday,
I am currently working on this bug
@AhmedMAlbreiki Please submit a PR so we can help review, thanks!
Hello, Im helping out with this issue, i have some questions for this
Currently the rank is computed in _get_nccl_collective_communicator
where in it sets is like so actual_rank = self.rank * len(device_list) + i
. Is this the issue in question where it needs to be replaced by start_gpu_rank = something magical
?
Still trying to fully understand the issue, thanks
Background
Alpa initializes collective groups for each cross-mesh communication pair. The call stack to initialize a collective group is:
create_collective_group
orinit_collective_group
fromcollective.py
calls:create_collective_group
ofGroupManager
class incollective.py
calls:NCCLGroup.__init__
with two different implementations. One is based on cupy, while the other is based on xla.A
NCCLGroup
creates and manages nccl communicators for each GPU in this node. When we need to call a nccl function, we finally goes into theNCCLGroup
to call it. However, in our current implementation, we usenode_rank * num_devices_per_node + local_offset
to compute the rank of a local GPU w.r.t. the communication group. An example is here. This is correct in most cases, but when the send mesh has a different number of devices per node with the receive mesh, it is incorrect.TODO
start_gpu_rank
at the initialization ofNCCLGroup
.