-
### System Info
8*A100 with docker enviroment
### Information
- [x] The official example scripts
- [ ] My own modified scripts
### 🐛 Describe the bug
training always abort after saving the checkp…
-
When I use Megatron.core to train a moe model, I got the following bugs :
**Output Info :**
[rank2]:[E ProcessGroupNCCL.cpp:754] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(Seq…
-
### 🐛 Describe the bug
Using `torch.distributed.barrier()` doesn't work with NCCL. I use the code
```python
import torch
def train() -> None:
torch.distributed.init_process_group('nccl'…
-
**Describe the bug**
```
99%|█████████▉| 23054/23316 [2:22:47
-
Hi, I have observed although i have passed `NCCL_SHM_DISABLE: 1`. Still it try to access `/dev/shm` and gave the error. Is this behaviour is as expected or it's a bug. Below i have attached the log fo…
-
Hello
Currently, our client company is supporting nccl-test.
We are supporting it by writing the script below.
mpirun -np 300 -N 1 -x NCCL_DEBUG=INFO --hostfile /nccl/hostfile \
-mca plm_rsh_no_…
-
Hi, I got socketProgress: Connection closed by remote peer when execute ncclAllToAll via ncclSend & ncclRecv.
I noticed that if NCCL_SOCKET_RECV **zero** bytes, it will close the socket:
```
if (op =…
-
In intra-node collective communication works well via NCCL(H100 HGX server with NVswitch), but we encountered below error in terms of infiniband device error for inter-node communication(GPU Direct RD…
-
### 问题描述
固定 seed 测了下,为了确认 seed 是固定的,先重复运行了多卡脚本,确保每次图像不变。
在这个条件下,不同卡数生成的图像:
| | image |
|--------------------------------|-------|
| flux_result_dp1_cfg1_ulysses1_…
-
### 🐛 Describe the bug
The `unordered` pg destroy test introduced in https://github.com/pytorch/pytorch/pull/119045 seems to no longer be supported in recent versions of NCCL. When checking with the …