-
The following error occurred when I used openmpi to run nccl-tests on multiple machines (on two nodes)
![image](https://github.com/NVIDIA/nccl-tests/assets/79404809/0a378553-9b5c-4cce-86d7-777400193a…
-
In our SLURM cluster we use dual attached servers connected by L3 BGP unnumbered (with FRR "BGP to the host") via lan0 and lan1 interfaces (ECMP, see routing table, which is really simple, as everythi…
-
While running the Multi-gpu Pytorch tests, test_all_reduce_coalesced_nccl is failing in pytorch/test/test_c10d_nccl.py. It seems like the error is coming because of inconsistent results from allreduce…
-
Hi! I wanted to surface a quirk of running jax on an AWS GPU cluster in case it's helpful — jax's vendored NCCL doesn't play well with AWS Libfabric EFA due to the way that jax starts processes (issue…
-
### What happened + What you expected to happen
I'm running the online DPO code on multi nodes in MegatronLM. There are a total of three nodes. Among them, four cards are allocated for the actor mod…
-
Hi 👋 ,
When trying to run any NCCL application, it seems that it always hangs when running on more than 2 GPUs (see attached logs with `NCCL_DEBUG=TRACE NCCL_DEBUG_SUBSYS=ALL`.
The command is ex…
-
We have observed a significant performance degradation in the alltoall operation when using NCCL versions 2.19 and 2.20 compared to version 2.18.
**System Configuration:**
Max Node: 8
Machine Typ…
-
when i use dino config to test with pt1.13+mmcv 2.0.0, i got this error
-
[0] NCCL INFO cudaDriverVersion 11040
[0] NCCL INFO Bootstrap : Using eth0:10.84.253.70
[0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
File "/usr/loc…
-
Hi.
I have been running NCCL_TESTS on a multi-node, multi-GPU environment with NCCL 2.19.3-1 and OpenMPI 4.1.6. Each node has 4 NVIDIA V100 GPUs interconnected with NVLink and PCIe.
1. How is th…