Open dphow opened 2 months ago
Hello! These numbers look broadly reasonable and actually a tiny bit better than what we observe on Perlmutter (e.g. 2-node, 8 GPU allreduce at large message sizes saturates around 76 GB/s of busbw for us).
It's interesting that it is working for you using the 1.7.4-aws
branch of the NCCL plugin -- if you check the release notes for the releases tagged with -aws
, they say:
This release is intended only for use on AWS P* instances. A general release that supports other Libfabric networks will be made in the near future.
As such we've avoided using them and stuck with 1.6.0, which is the most recent non -aws
tag they've released. I guess the -aws
branches are more broadly compatible than that message might indicate, but also maybe any potential problems aren't getting rooted out by simple nccl-tests
. Have you tried any more complex (e.g. deep learning training) workloads on your system with this stack?
I can add that with the 1.6.0
NCCL plugin we've recently had some success with pytorch
. We've also compiled it from source with some very minimal patches so that we can use NCCL+AWS-OFI
but also use the MPI backend and run with cray-mpich
. That all holds together, with MPI and NCCL performance pretty comparable. We did not try that with any of the newer -aws
tags, though.
We've also compiled it from source with some very minimal patches so that we can use NCCL+AWS-OFI but also use the MPI backend and run with cray-mpich
Can you share more detail about what needed to be patched in your case?
Sure can - took a while to get all the build dependencies satisfied on our system, but once we did I was dismayed to see this runtime error:
[rank0]: File "/glade/work/benkirk/conda-envs/pytorch-mpi-derecho-gcc-12.2.0-cray-mpich-8.1.27/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 2219, in all_reduce
[rank0]: work = group.allreduce([tensor], opts)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: CUDA tensor detected and the MPI used doesn't have CUDA-aware MPI support
git grep
and following that message let me to one or two files that need patching - depending on pytorch version:
caffe2/mpi/mpi_ops_gpu.cc b/caffe2/mpi/mpi_ops_gpu.cc. # <-- gone in pytorch-v2.4.0
torch/csrc/distributed/c10d/ProcessGroupMPI.cpp
The root cause is all the Cuda-awareness is wrapped in some #if defined(OPEN_MPI) && OPEN_MPI
guards. And the fallback is to assume whatever MPI is being used is not Cuda-aware.
So on Derecho with cray-mpich
I simply hard-coded the fallbacks to assume the opposite, e.g. Cuda-Awareness.
https://github.com/benkirk/derecho-pytorch-mpi/tree/main/patches
This is all a work-in-progress but we've had success running it. Ultimately I'd like to turn those patches into a PR for pytorch so you can define the fallback assumptions with a configure argument or something, for non-OpenMPI builds
Ah, right, thanks! I build pytorch with cray-mpich support but never bothered with trying to enable cuda-awareness because of those issues, and figured NCCL would be the more performant option anyway. It's cool to know that it can be made to work, and it's interesting to hear you saw comparable performance. If you have any performance results to share I'd be interested to see them. If you manage to upstream the "fix" let us know. That'd be a nice pytorch contribution. Thanks again for sharing.
Hello NERSC Team,
HPCD CISL staff at NSF NCAR would like to request a performance comparison of your use of this nccl-ofi-plugin to ours as copied below. We used a test suite that expands a bit on your test runs found in this forked Github by @benkirk but please let us know if this is something you can do and provide a turn around on in a couple weeks.
Primarily, we would like to compare settings placed in the https://github.com/benkirk/nccl-ofi-plugin/blob/main/env_nccl_derecho.sh in order to determine what should be optimal.
To note, the below test used NCCL 2.22.3-1 and AWS NCCL Plugin 1.7.4. I am comfortable using the NCCL latest version but suspect it would be difficult to adapt AWS plugin usage beyond 1.7.4 given their specific targeting of AWS machines after that version.
Avg Bus Bandwidth (GB/s) per test suite ran: all-gather Intra-node (2GPUs) - 45.3614 Intra-node (4GPUs) - 127.491 Inter-node (2GPUs) - 14.0373 Inter-node (4GPUs) - 29.4738 Inter-node (8GPUs) - 50.7769
all-reduce Intra-node (2GPUs) - 57.1399 Intra-node (4GPUs) - 154.882 Inter-node (2GPUs) - 16.6409 Inter-node (4GPUs) - 29.9176 Inter-node (8GPUs) - 51.8871
all-to-all Intra-node (2GPUs) - 49.2425 Intra-node (4GPUs) - 135.873 Inter-node (2GPUs) - 13.1876 Inter-node (4GPUs) - 18.9097 Inter-node (8GPUs) - 21.0073
send-recv Intra-node (2GPUs) - 56.3249 Intra-node (4GPUs) - 63.7496 Inter-node (2GPUs) - 14.6531 Inter-node (4GPUs) - 17.0042 Inter-node (8GPUs) - 16.3461
Thanks!