nccl Search Results - Githubissues

1000+ results
for nccl

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

X-LANCE/SLAM-LLM #109

NCCL error when saving with DDP

### System Info 8*A100 with docker enviroment ### Information - [x] The official example scripts - [ ] My own modified scripts ### 🐛 Describe the bug training always abort after saving the checkp…

Vindicator645 updated 2 months ago
2
NVIDIA/Megatron-LM #735

[BUG] NCCL TIMEOUT ( maybe ALLREDUCE ? )

When I use Megatron.core to train a moe model, I got the following bugs : **Output Info :** [rank2]:[E ProcessGroupNCCL.cpp:754] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(Seq…

ZhangEnmao updated 1 month ago
7
pytorch/pytorch #129749

nccl deadlock

### 🐛 Describe the bug Using `torch.distributed.barrier()` doesn't work with NCCL. I use the code ```python import torch def train() -> None: torch.distributed.init_process_group('nccl'…

nikonikolov updated 4 months ago
6
microsoft/DeepSpeed #6705

[BUG] NCCL Timeout When Pre-traing "ds_train_bert_nvidia_dat…

**Describe the bug** ``` 99%|█████████▉| 23054/23316 [2:22:47

always-H updated 3 days ago
2
NVIDIA/nccl #1466

[BUG]: NCCL_SHM_DISABLE flag is not working

Hi, I have observed although i have passed `NCCL_SHM_DISABLE: 1`. Still it try to access `/dev/shm` and gave the error. Is this behaviour is as expected or it's a bug. Below i have attached the log fo…

priyanshu891 updated 1 month ago
1
NVIDIA/nccl #1454

300node 8GPU 4 IB NCCL TEST

Hello Currently, our client company is supporting nccl-test. We are supporting it by writing the script below. mpirun -np 300 -N 1 -x NCCL_DEBUG=INFO --hostfile /nccl/hostfile \ -mca plm_rsh_no_…

gim4moon updated 1 month ago
4
NVIDIA/nccl #1486

NCCL WARN socketProgress: Connection closed by remote peer

Hi, I got socketProgress: Connection closed by remote peer when execute ncclAllToAll via ncclSend & ncclRecv. I noticed that if NCCL_SOCKET_RECV **zero** bytes, it will close the socket: ``` if (op =…

ganyu1992 updated 3 weeks ago
2
NVIDIA/nccl #1467

Enroot

In intra-node collective communication works well via NCCL(H100 HGX server with NVswitch), but we encountered below error in terms of infiniband device error for inter-node communication(GPU Direct RD…

dobiup updated 1 month ago
1
xdit-project/xDiT #262

FLUX with SP 并行生成图像差异

### 问题描述固定 seed 测了下，为了确认 seed 是固定的，先重复运行了多卡脚本，确保每次图像不变。在这个条件下，不同卡数生成的图像： | | image | |--------------------------------|-------| | flux_result_dp1_cfg1_ulysses1_…

lixiang007666 updated 6 days ago
7
pytorch/pytorch #137507

[NCCL] Unordered destruction of `ProcessGroupNCCL` no longer…

### 🐛 Describe the bug The `unordered` pg destroy test introduced in https://github.com/pytorch/pytorch/pull/119045 seems to no longer be supported in recent versions of NCCL. When checking with the …

eqy updated 4 weeks ago
3

上一页 1...4 5 6 7 8 9 10...100 下一页

1000+ results for nccl

1000+ results
for nccl