nccl Search Results - Githubissues

1000+ results
for nccl

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

pytorch/pytorch #139581

Suggestion : enable one to adjust the interval between calls…

### 🚀 The feature, motivation and pitch While looking at sources of performance variability for multi-node training jobs, we have found that one mechanism is associated with activity of the pytorch N…

walkup updated 5 days ago
2
NVIDIA/nccl #1468

Topology printing bug in `ncclTopoPrintGraph` in NCCL 2.23?

Hi NCCL experts, I was recently investigating some of the output of `NCCL_DEBUG_SUBSYS=Graph` with the latest NCCL version (NCCL 2.23). I was specifically looking at some of the output that `ncclTop…

ezhang887 updated 1 month ago
4
NVIDIA/nccl #1464

NCCL GPU affinity (nvidia-smi topo -m) on VM with fail PIX,P…

Hi, I need your help seriously. First, NVIDIA HW/SW component for intra-node and inter-node communication are working well as point of compatibility, enabled. NVSwitch (NVlink, Fabric Manager), GPU (…

dobiup updated 1 month ago
4
ROCm/rccl #1261

Encountering issues while using the UCX plugin

When using rccl rdma sharp plugin, I encountered a program crash with the following log: ``` `[root@node01 ~]# mpirun \ > -np 2\ > --oversubscribe \ > --allow-run-as-root\ > -H n…

clearsky07 updated 1 hour ago
1
ray-project/ray #47864

[core][aDAG] Hang when using ray before using adag

### What happened + What you expected to happen ``` from time import perf_counter from time import sleep from contextlib import contextmanager from typing import Callable STATIC_SHAPE = False …

rkooo567 updated 3 weeks ago
4
pytorch/pytorch #137392

segfault when using DTensor with nonblocking nccl comm

### 🐛 Describe the bug ```python from torch.distributed._tensor import Replicate, Shard, distribute_tensor, init_device_mesh import torch from torch import distributed as dist if __name__ == …

ppwwyyxx updated 4 weeks ago
5
skypilot-org/skypilot #3788

[k8s] multinode torch distributed nccl timeout

I tried running `examples/torch_ddp_benchmark` on kubernetes but the tasks hangs with the following error until throwing an NCCL timeout. It might be related to this [issue](https://github.com/pyt…

asaiacai updated 3 months ago
2
NVIDIA/nccl #1444

How can I see the algorithm chosen by NCCL?

I consulted the NCCL documentation and found that by using NCCL_ALGO and NCCL_PROTO, I can specify the algorithm and protocol used when running NCCL. For example, -x NCCL_ALGO=Ring -x NCCL_PROTO=LL in…

Eevan-zq updated 1 month ago
2
NVIDIA/nccl #1499

ncclInternalError: Internal check failed

### 🐛 Describe the bug I met an error when I use torchrun for 4 GPUs training and 'nccl' backend (It runs perfect when I use 'gloo'). The environment is python3.9+pytorch2.3.0+cuda12.1.We tried to us…

whiteyn updated 1 week ago
3
aws/aws-ofi-nccl #472

NCCL Cannot Find Tuner Symbols. Need to Export NCCL_TUNER_PL…

Hello, I followed the [official AWS AWS-OFI Plugin installation guide](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start-nccl.html#nccl-start-base-plugin), but I found that there is a p…

zhanwenchen updated 3 months ago
1

上一页 1...5 6 7 8 9 10 11...100 下一页

1000+ results for nccl

1000+ results
for nccl