nccl Search Results - Githubissues

1000+ results
for nccl

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

nv-legate/legate.core #959

BLD: compilation failure at comm_nccl.cu

On the LANL Venado machine, Linux ARM/Grace-Hopper architecture, whether using clang 18 (`Cray clang version 18.0.0`) or gcc-13 (`13.2.1`) compiler toolchain (both with `nvcc` from CUDA 12.5), the sam…

tylerjereddy updated 1 week ago
11
NVIDIA/nccl #1510

GPU Direct RDMA Disabled for HCA

I train distributed on 2 nodes 8 H100 use repo llama-factory scripts: Node 1: export NCCL_IB_GID_INDEX=4 export NCCL_IB_HCA=mlx5 export NCCL_IB_DISABLE=0 export NCCL_DEBUG=INFO export NCCL_DEBUG_SUBS…

hiennguyennq updated 2 days ago
3
NVIDIA/nccl #1502

nccl topo about PHB and NODE

![Image](https://github.com/user-attachments/assets/d5caaf41-6fb1-4463-b4f7-f188bd3f8664) Like this graph, NET/1 and GPU(0) in the same numa, NET/0 and GPU(1) in the same numa, all under the same rc,…

jianzi123 updated 1 week ago
1
conan-io/conan-center-index #4884

[request] nccl/2.8.4

### Package Details * Package Name/Version: **nccl/2.8.4** * Website: **https://developer.nvidia.com/nccl** * Source code: **https://github.com/NVIDIA/nccl** ### Description Of The Libra…

SpaceIm updated 1 week ago
2
NVIDIA/nccl #1409

NCCL timeout issue

Hi, I'm having this issue: Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80078...) ran for 600026 milliseconds before timing out The code I'm running is a VQGAN training script. Par…

wd255 updated 2 weeks ago
4
NVIDIA/modulus #691

🐛[BUG]: Some NCCL operations have failed or timed out. Due t…

### Version 24.07 ### On which installation method(s) does this occur? Docker ### Describe the issue when i train_graphcast on a single player with multiple cards，encounter the following error me…

wlu1998 updated 3 weeks ago
4
SAI990323/TALLRec #61

ddp training problem (NCCL during evaluation)

1. Remove the words "YES" and "NO" from product titles because of the sick evaluation process! or using > `return logits[:, 1][-1:], gold[-1:]` in function preprocess_logits_for_metrics…

SlenderMongoose updated 1 month ago
1
NVIDIA/nccl-tests #256

NCCL topology on the VM of H200

We have 2 H200 servers connected with the IP switch. We ran nccl_test and all_reduce_perf script worked well and had expected performance on the baremetal system. ``` fs@fs-207:~$ mpirun -np 16 -H 20…

wangjiafu0310 updated 3 weeks ago
4
NVIDIA/nccl-tests #262

How to get the latency and the package of NCCL

Hi When I was utilizing GPU for AI inference I found the communication takes much time in latency (compared to the open sourced code in NCCL) even with same H/W & S/W configuration on different serv…

gabbychen updated 6 days ago
3
xorbitsai/inference #1622

BUG: NCCL error:

使用v0.12.0docker镜像部署，启动命令如下： sudo docker run -d -v /home/tskj/MOD/:/home/MOD/ -e XINFERENCE_HOME=/home/MOD -p 9997:9997 --gpus all xprobe/xinference:v0.12.0 xinference-local -H 0.0.0.0 --log-level de…

ye7love7 updated 2 weeks ago
13

上一页 1...1 2 3 4 5 6 7...100 下一页

1000+ results for nccl

1000+ results
for nccl