NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
775 stars 226 forks source link

Add bisection test #203

Open x41lakazam opened 4 months ago

x41lakazam commented 4 months ago

In a bisection, each rank is paired with one other rank called 'peer', they both send and receive N messages, this number is bound to the agg_iters parameter.

The selection of this 'peer' rank is defined by the getPeer function: https://github.com/x41lakazam/nccl-tests/blob/bisection_test/src/bisection.cu#L19

sjeaugey commented 4 months ago

Can't you already do that running the sendrecv_perf test, and setting NCCL_TESTS_SPLIT_MASK=(nranks/2)-1?

Sure, that only works with nranks being a power of two; maybe your code is more generic.

x41lakazam commented 4 months ago

Being able to run this test when nranks is not a power of two is actually important to us

x41lakazam commented 3 months ago

@AddyLaddy @sjeaugey

The code is ready to merge from our side Please let me know what you think about it, thanks