-
On the LANL Venado machine, Linux ARM/Grace-Hopper architecture, whether using clang 18 (`Cray clang version 18.0.0`) or gcc-13 (`13.2.1`) compiler toolchain (both with `nvcc` from CUDA 12.5), the sam…
-
I train distributed on 2 nodes 8 H100 use repo llama-factory
scripts:
Node 1:
export NCCL_IB_GID_INDEX=4
export NCCL_IB_HCA=mlx5
export NCCL_IB_DISABLE=0
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBS…
-
![Image](https://github.com/user-attachments/assets/d5caaf41-6fb1-4463-b4f7-f188bd3f8664)
Like this graph, NET/1 and GPU(0) in the same numa, NET/0 and GPU(1) in the same numa, all under the same rc,…
-
### Package Details
* Package Name/Version: **nccl/2.8.4**
* Website: **https://developer.nvidia.com/nccl**
* Source code: **https://github.com/NVIDIA/nccl**
### Description Of The Libra…
-
Hi, I'm having this issue: Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80078...) ran for 600026 milliseconds before timing out
The code I'm running is a VQGAN training script. Par…
wd255 updated
2 weeks ago
-
### Version
24.07
### On which installation method(s) does this occur?
Docker
### Describe the issue
when i train_graphcast on a single player with multiple cards,encounter the following error me…
-
1. Remove the words "YES" and "NO" from product titles because of the sick evaluation process! or using
> `return logits[:, 1][-1:], gold[-1:]`
in function preprocess_logits_for_metrics…
-
We have 2 H200 servers connected with the IP switch. We ran nccl_test and all_reduce_perf script worked well and had expected performance on the baremetal system.
```
fs@fs-207:~$ mpirun -np 16 -H 20…
-
Hi
When I was utilizing GPU for AI inference
I found the communication takes much time in latency (compared to the open sourced code in NCCL)
even with same H/W & S/W configuration on different serv…
-
使用v0.12.0docker镜像部署,启动命令如下:
sudo docker run -d -v /home/tskj/MOD/:/home/MOD/ -e XINFERENCE_HOME=/home/MOD -p 9997:9997 --gpus all xprobe/xinference:v0.12.0 xinference-local -H 0.0.0.0 --log-level de…