-
When running the eval.py script with "--use_dist True", I am facing this error:
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370128159/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, u…
-
Similar to NCCL tests for Kubernetes https://github.com/aws-samples/awsome-distributed-training/tree/main/micro-benchmarks/nccl-tests/kubernetes - it would be great if there was a similar test for NCC…
-
We are seeing an issue with NCCL allreduce performance that we would appreciate Nvidia's help on.
We have three nodes split across two racks: Two nodes on one rack and one node on another rack.
Two-…
-
### Your current environment
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (U…
-
Would like to install NCCL as a dedicated module which can be linked into PyTorch / Tensorflow / other programs that want to use an optimized internode collective / peer communicaton.
Repos to refe…
dphow updated
3 months ago
-
**Describe the bug**
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图)
模型训练到固定step的时候,NCCL超时
![6512a21092e02368ce384707d830cf8b](https://github.com/user-attachments/…
-
Running the program on 4 GPUs, an error occurs at line 343 of train_multidatasets.py, getting stuck at the line results = evaluator.evaluate() in the inference_on_dataset function, The error message i…
-
I am sharing this error in the hope that you find it useful. Below is the traceback. Let me know if you there's anything I can do to make it more verbose or any particular info you want about my envir…
-
### bug描述 Describe the Bug
NGC Paddle将会更新到v3.0-beta2,`test_collective_reduce_scatter_api.py`会报错`AttributeError: 'paddle.base.libpaddle.pir.Value' object has no attribute 'desc'`。
我用Paddle官方提供的docker…
-
### Problem Description
I am going to use VLLM to start a QWEN model on an AMD GPU for testing. If I use a GPU to start it, it can start and use it normally. The log after startup is as follows:
`
…