aws / aws-ofi-nccl

This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.
Apache License 2.0
147 stars 56 forks source link

Add Multiplexed-round-robin scheduler #604

Closed arunkarthik-akkart closed 1 month ago

arunkarthik-akkart commented 2 months ago

Description of changes: This change modifies the current scheduling policy such that we either round-robin on each rail or a pair of rails or a triplet of rails or a quadruplet of rails based on the min_stripe_size. Within the group of rails the message is divided by the stripe_size and multiplexed on each rail.

The num_stripes to be multiplexed is calculated using the below formula, num_stripes = min ( ceil ( msg_size / min_stripe_size ) , num_rails )

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

AmedeoSapio commented 2 months ago

The CI failure is a real issue. We need to figure out how this can happen. gather_perf: nccl_ofi_rdma.c:223: get_send_comm_rail: Assertion rail_id < s_comm->num_init_rails failed.

aws-nslick commented 1 month ago

after passing every other box, the runner lost network, couldn't query pypi, and timed out on pip install ./pf.

Command . /home/ubuntu/PortaFiducia/env/bin/activate; pip install -Ue /home/ubuntu/PortaFiducia failed with error:
  error: subprocess-exited-with-error
[...]
WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7cf6edc4cdc0>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /simple/setuptools/
ERROR: Could not find a version that satisfies the requirement setuptools>=40.8.0 (from versions: none)
ERROR: No matching distribution found for setuptools>=40.8.0

bot:aws:retest

aws-nslick commented 1 month ago

I rebased #627 on this and did a build: link

-Wsign-compare does not like this line:

../include/nccl_ofi_math.h:29:46: error: signed and unsigned type in conditional expression [-Werror=sign-compare]
 #define  NCCL_OFI_MIN(x, y) ((x) < (y) ? (x) : (y))
                                              ^
../include/nccl_ofi_math.h:24:42: note: in definition of macro 'NCCL_OFI_MAX'
 #define NCCL_OFI_MAX(x, y) ((x) < (y) ? (y) : (x))
                                          ^
nccl_ofi_scheduler.c:45:42: note: in expansion of macro 'NCCL_OFI_MIN'
  int num_stripes = (int) NCCL_OFI_MAX(1, NCCL_OFI_MIN(NCCL_OFI_DIV_CEIL(size, scheduler->min_stripe_size), num_rails));

I was going to suggest making num_rails unsigned in the signature, but if you follow this all the way out, it ultimately comes from just matching the choice of the type on device->num_rails... int is probably the wrong type there (I don't know what it would mean for a device to have a negative amount of rails) but changing something so far away is a pretty intrusive change to ask for at this point in this PR.

If this passes jenkins on this go around, please push and I'll rebase on it and amend the commit for #575 with a fixup. (I'd like to drop #576 anyway.) Otherwise, if you end up having to rebase on #627, please do (unsigned)num_rails there or similar.