Closed arunkarthik-akkart closed 1 month ago
The CI failure is a real issue. We need to figure out how this can happen.
gather_perf: nccl_ofi_rdma.c:223: get_send_comm_rail: Assertion rail_id < s_comm->num_init_rails failed.
after passing every other box, the runner lost network, couldn't query pypi, and timed out on pip install ./pf
.
Command . /home/ubuntu/PortaFiducia/env/bin/activate; pip install -Ue /home/ubuntu/PortaFiducia failed with error:
error: subprocess-exited-with-error
[...]
WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7cf6edc4cdc0>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /simple/setuptools/
ERROR: Could not find a version that satisfies the requirement setuptools>=40.8.0 (from versions: none)
ERROR: No matching distribution found for setuptools>=40.8.0
bot:aws:retest
I rebased #627 on this and did a build: link
-Wsign-compare
does not like this line:
../include/nccl_ofi_math.h:29:46: error: signed and unsigned type in conditional expression [-Werror=sign-compare]
#define NCCL_OFI_MIN(x, y) ((x) < (y) ? (x) : (y))
^
../include/nccl_ofi_math.h:24:42: note: in definition of macro 'NCCL_OFI_MAX'
#define NCCL_OFI_MAX(x, y) ((x) < (y) ? (y) : (x))
^
nccl_ofi_scheduler.c:45:42: note: in expansion of macro 'NCCL_OFI_MIN'
int num_stripes = (int) NCCL_OFI_MAX(1, NCCL_OFI_MIN(NCCL_OFI_DIV_CEIL(size, scheduler->min_stripe_size), num_rails));
I was going to suggest making num_rails unsigned in the signature, but if you follow this all the way out, it ultimately comes from just matching the choice of the type on device->num_rails
... int
is probably the wrong type there (I don't know what it would mean for a device to have a negative amount of rails) but changing something so far away is a pretty intrusive change to ask for at this point in this PR.
If this passes jenkins on this go around, please push and I'll rebase on it and amend the commit for #575 with a fixup. (I'd like to drop #576 anyway.) Otherwise, if you end up having to rebase on #627, please do (unsigned)num_rails
there or similar.
Description of changes: This change modifies the current scheduling policy such that we either round-robin on each rail or a pair of rails or a triplet of rails or a quadruplet of rails based on the min_stripe_size. Within the group of rails the message is divided by the stripe_size and multiplexed on each rail.
The num_stripes to be multiplexed is calculated using the below formula, num_stripes = min ( ceil ( msg_size / min_stripe_size ) , num_rails )
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.