-
### 🚀 The feature, motivation and pitch
While looking at sources of performance variability for multi-node training jobs, we have found that one mechanism is associated with activity of the pytorch N…
-
Hi NCCL experts,
I was recently investigating some of the output of `NCCL_DEBUG_SUBSYS=Graph` with the latest NCCL version (NCCL 2.23).
I was specifically looking at some of the output that `ncclTop…
-
Hi, I need your help seriously.
First, NVIDIA HW/SW component for intra-node and inter-node communication are working well as point of compatibility, enabled.
NVSwitch (NVlink, Fabric Manager), GPU (…
-
When using rccl rdma sharp plugin, I encountered a program crash with the following log:
```
`[root@node01 ~]# mpirun \
> -np 2\
> --oversubscribe \
> --allow-run-as-root\
> -H n…
-
### What happened + What you expected to happen
```
from time import perf_counter
from time import sleep
from contextlib import contextmanager
from typing import Callable
STATIC_SHAPE = False
…
-
### 🐛 Describe the bug
```python
from torch.distributed._tensor import Replicate, Shard, distribute_tensor, init_device_mesh
import torch
from torch import distributed as dist
if __name__ == …
-
I tried running `examples/torch_ddp_benchmark` on kubernetes but the tasks hangs with the following error until throwing an NCCL timeout. It might be related to this [issue](https://github.com/pyt…
-
I consulted the NCCL documentation and found that by using NCCL_ALGO and NCCL_PROTO, I can specify the algorithm and protocol used when running NCCL. For example, -x NCCL_ALGO=Ring -x NCCL_PROTO=LL in…
-
### 🐛 Describe the bug
I met an error when I use torchrun for 4 GPUs training and 'nccl' backend (It runs perfect when I use 'gloo'). The environment is python3.9+pytorch2.3.0+cuda12.1.We tried to us…
-
Hello,
I followed the [official AWS AWS-OFI Plugin installation guide](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start-nccl.html#nccl-start-base-plugin), but I found that there is a p…