nccl Search Results - Githubissues

1000+ results
for nccl

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

NVIDIA/nccl-tests #204

Test NCCL failure common.cu:961 'internal error - please rep…

The following error occurred when I used openmpi to run nccl-tests on multiple machines (on two nodes) ![image](https://github.com/NVIDIA/nccl-tests/assets/79404809/0a378553-9b5c-4cce-86d7-777400193a…

a-c-dream updated 5 months ago
7
NVIDIA/nccl #1053

NCCL with L3 BGP unnumbered, lo always uses ip 127.0.0.1

In our SLURM cluster we use dual attached servers connected by L3 BGP unnumbered (with FRR "BGP to the host") via lan0 and lan1 interfaces (ECMP, see routing table, which is really simple, as everythi…

itzsimpl updated 2 months ago
3
microsoft/msccl #35

Multi-gpu NCCL test test_all_reduce_coalesced_nccl failing

While running the Multi-gpu Pytorch tests, test_all_reduce_coalesced_nccl is failing in pytorch/test/test_c10d_nccl.py. It seems like the error is coming because of inconsistent results from allreduce…

ajindal1 updated 2 years ago
3
jax-ml/jax #14534

NCCL and compatibility with AWS EFA

Hi! I wanted to surface a quirk of running jax on an AWS GPU cluster in case it's helpful — jax's vendored NCCL doesn't play well with AWS Libfabric EFA due to the way that jax starts processes (issue…

mathemakitten updated 9 months ago
2
ray-project/ray #46795

Ray gets stuck at the second training iteration

### What happened + What you expected to happen I'm running the online DPO code on multi nodes in MegatronLM. There are a total of three nodes. Among them, four cards are allocated for the actor mod…

yanzhaodong2024 updated 1 month ago
1
NVIDIA/nccl-tests #216

NCCL initialization hangs with 4 GPUs, but works with 2 GPUs

Hi 👋 , When trying to run any NCCL application, it seems that it always hangs when running on more than 2 GPUs (see attached logs with `NCCL_DEBUG=TRACE NCCL_DEBUG_SUBSYS=ALL`. The command is ex…

mickaelseznec updated 4 months ago
4
NVIDIA/nccl #1316

Performance Degradation in Alltoall Operation with NCCL 2.19…

We have observed a significant performance degradation in the alltoall operation when using NCCL versions 2.19 and 2.20 compared to version 2.18. **System Configuration:** Max Node: 8 Machine Typ…

GeofferyGeng updated 3 months ago
5
open-mmlab/mmdetection #10126

RuntimeError: NCCL Error 1: unhandled cuda error

when i use dino config to test with pt1.13+mmcv 2.0.0, i got this error

shixiaotong123 updated 6 months ago
1
BlackSamorez/tensor_parallel #121

RuntimeError: NCCL Error 3: internal error

[0] NCCL INFO cudaDriverVersion 11040 [0] NCCL INFO Bootstrap : Using eth0:10.84.253.70 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation File "/usr/loc…

smallmocha updated 5 months ago
1
NVIDIA/nccl-tests #215

NCCL_ALGO on multi-node and multi-GPU

Hi. I have been running NCCL_TESTS on a multi-node, multi-GPU environment with NCCL 2.19.3-1 and OpenMPI 4.1.6. Each node has 4 NVIDIA V100 GPUs interconnected with NVLink and PCIe. 1. How is th…

MajidSalimi updated 4 months ago
1

上一页 1...17 18 19 20 21 22 23...100 下一页

1000+ results for nccl

1000+ results
for nccl