-
Not sure if intended or avoidable, but if `dst` is inconsistent across ranks, `reduce` finishes, but future kernels seems to hang. E.g.,
```py
import torch
torch.distributed.init_process_group('ncc…
-
Hello,
I am curioused about the use of GPUDirect RDMA across physical machines. My setup involves two physical machines, each equipped with one GPU and an RDMA-capable network card, which is not on…
-
Am attempting to reproduce your results, as was wondering if you have hit the following issue before. I am using a Lambda Labs 8xH100 SXM5 instance, and run the following commands from a fresh instanc…
-
### Your current environment
The output of `python collect_env.py`
```text
Collecting environment information...
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTor…
-
Hi,sjeaugey.
The performance of my test using NCCL+MPIURN is not as expected.My environment is as follows:Node:1,H800*8,infiniband200Gbps*4;Node:2,H800*8,infiniband400GbpsIB*4;NCCL:2.18.3,CUDA:12.2,O…
-
The code is as follows:
```
size_t bytePerChannel[/*collNetSupport*/2];
if (comm->channelSize > 0) {
// Set by user
bytePerChannel[/*collNetSupport=*/0] = comm->channelSize;
byte…
ltrcc updated
8 months ago
-
Hi, the command I use is `NCCL_DEBUG=INFO mpirun -np 2 -hosts master,worker ./build/all_reduce_perf -b 4M -e 4M -f 2 -g 8`
It works when I run mpi example. So, I think the problem is releted to my op…
-
### Your current environment
```PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version:…
-
Hi,
I have a CV training flow with sync batch normalization. However, the NCCL allgather in the **first** sync batch normalization of a forward pass is extremely long (>10s) while the NCCL all gath…
-
I am currently using Horovod for model training. The communication of the underlying gradient synchronization uses nccl. The problem of slow nodes will appear during the training process. Is there any…