Open visatish opened 2 months ago
@bwbarrett I noticed you had helped with some related issues
Hi, can you please try enabling all 4 EFAs?
@AmedeoSapio I was actually able to get it working with the native pytorch version in the AMI, i.e. conda activate pytorch
:
(head, rank=0, pid=36870) [ip-172-31-40-103:0]:The average bandwidth of all_reduce with a 4.0GB payload (5 trials, 16 ranks):
(head, rank=0, pid=36870) [ip-172-31-40-103:0]: algbw: 11.135 GBps (89.1 Gbps)
(head, rank=0, pid=36870) [ip-172-31-40-103:0]: busbw: 20.878 GBps (167.0 Gbps)
I will try with 4 NICs, but presumably that will just increase bandwidth.
This hints that there is some incompatibility between aws-ofi-nccl and the latest torch + torch deps (I have updated the original issue to note that I was installing the latest fresh - i.e. pip install torch
before running cmds).
Hello. There is a known incompatibility between NCCL 2.19+ and Libfabric from EFA installers before 1.29. I'm guessing using the latest PyTorch will upgrade the NCCL version.
Workarounds are any of the following:
FI_EFA_SET_CUDA_SYNC_MEMOPS=0
in the environmentHi @rauteric, good to know! Is there any significant performance downside to (1) as that would be the least-invasive for our stack atm?
Hi @rauteric, good to know! Is there any significant performance downside to (1) as that would be the least-invasive for our stack atm?
No, this setting merely prevents Libfabric from setting a property on a CUDA buffer (sync_memops) that is not needed for NCCL. It shouldn't have any performance impact.
Gotcha, confirmed that FI_EFA_SET_CUDA_SYNC_MEMOPS=0
works with the latest pytorch+NCCL stack in the original example.
Might be nice for future new users to maybe "pin" this in some fashion under "Known problems/limitations" in an easy-to-find place or have an up-to-date compatibility chart. But for now, guess it's indexed in this ticket :)
Thanks again for the help!
For future searchers, if it's at all possible, please do prefer to update efa.ko and libfabric instead of relying on this environment variable -- this specific workaround doesn't come with a perf hit, but you are missing out on other performance improvements and bug fixes by using older versions, and you should update whenever you can.
@visatish we've documented a bunch of these efa/nccl related failure modes in awesome-distributed-training repo, i.e. https://github.com/aws-samples/awsome-distributed-training/issues/203
Hi,
I'm trying to run a nccl allreduce benchmark on AWS EC2 and running into the following error:
Setup:
2x p4d.24xlarge
"Deep Learning AMI GPU PyTorch 2.1.0 (Ubuntu 20.04)" AMI
Relevant libs (note that I have installed the latest torch 2.4.1 & deps fresh):
torch-2.4.1-cp310-cp310-manylinux1_x86_64.whl
nvidia-nccl-cu12==2.20.5
Single EFA-enabled NIC (note that I know this instance type can support up to 4x, but I'm starting with 1):
Cmd:
From https://github.com/stas00/ml-engineering.git:
Output:
nccl_out.txt
Note this particular portion:
I'm not quite sure what
Error: Invalid argument
could be - any help is appreciated. Thnx!