[BUG] Collective group's rank is incorrect

ZYHowell commented 1 year ago

Background

Alpa initializes collective groups for each cross-mesh communication pair. The call stack to initialize a collective group is: create_collective_group or init_collective_group from collective.py calls: create_collective_group of GroupManager class in collective.py calls: NCCLGroup.__init__ with two different implementations. One is based on cupy, while the other is based on xla.

A NCCLGroup creates and manages nccl communicators for each GPU in this node. When we need to call a nccl function, we finally goes into the NCCLGroup to call it. However, in our current implementation, we use node_rank * num_devices_per_node + local_offset to compute the rank of a local GPU w.r.t. the communication group. An example is here. This is correct in most cases, but when the send mesh has a different number of devices per node with the receive mesh, it is incorrect.

TODO

[ ] Fix the bug above by adding a start_gpu_rank at the initialization of NCCLGroup.
[ ] Add tests for collective communications among meshes. For unit test on cross-mesh communication, please refer to this file.

ZYHowell commented 1 year ago

cc @jiaodong

AhmedMAlbreiki commented 1 year ago

Goodday,

I am currently working on this bug

zhisbug commented 1 year ago

@AhmedMAlbreiki Please submit a PR so we can help review, thanks!

AhmedRAlmansoori commented 1 year ago

Hello, Im helping out with this issue, i have some questions for this

Currently the rank is computed in _get_nccl_collective_communicator where in it sets is like so actual_rank = self.rank * len(device_list) + i. Is this the issue in question where it needs to be replaced by start_gpu_rank = something magical?

Still trying to fully understand the issue, thanks

alpa-projects / alpa

[BUG] Collective group's rank is incorrect #790

Background

TODO