-
While running the Multi-gpu Pytorch tests, test_all_reduce_coalesced_nccl is failing in pytorch/test/test_c10d_nccl.py. It seems like the error is coming because of inconsistent results from allreduce…
-
### 🐛 Describe the bug
When I try to finetune with ddp([LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory)) in wsl2(win10 host), I get this error:
```
DESKTOP-VMBL43V:1354:1354 [0] NCCL INFO …
-
```shell
$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 CPU Affinity NUMA Affinity GPU NUMA ID
G…
-
I am using the `mpirun `command to test the all_reduce_perf file of nccl-tests on two servers within the same local area network. I am able to run other files normally with the `mpirun `command, but w…
-
### Expected Behavior
Updating ComfyUI does not install nvidia dependencies on AMD systems
### Actual Behavior
Updating ComfyUI installs nvidia dependencies on AMD systems
### Steps to Reproduce
…
-
In https://github.com/coreylowman/cudarc/blob/886d6d27cd68da4f81ce30a98bdf1940a895f813/src/nccl/safe.rs#L242-L245, the sendbuf is given as an `&Option`, which involves wrapping the `T`, which is somet…
-
Hi, Recently I was using NCCL MNNVL, and the documentation said that an imex channel was needed to generate a fabric handle. I just want to know what an imex channel is? and its relationship with the …
vvmex updated
3 weeks ago
-
I think that NCCL is part of PyTorch? I am running Python 3.9 so I had to install torch using
-c=conda-forge
as specified in the instructions for installing torch. It seemed to install correctl…
-
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1639180594101/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3
-
Ray NCCL collectives fail allreduce on multi-GPU aws.G5 nodes because of an issue with how the node exposes topology information. The workaround is to apply `NCCL_P2P_DISABLE=1`, but this negatively i…