-
While looking at the contents of the [2.23.4 pypi wheel](https://pypi.org/project/nvidia-nccl-cu12/2.23.4/), I noticed that that ext-net's `nccl_net.h` is now included in the package.
```
$ find venv…
-
### Bug description
I am trying to run a very simple training script for 2 nodes and I always get this error:
Output:
```
(ve) root@442a8ba5c0c6:~/ptl# . wr.sh
Start fitting...
Initializing…
-
rank1]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[r…
-
### What happened + What you expected to happen
[Microbenchmark](https://github.com/ray-project/ray/blob/master/python/ray/_private/ray_experimental_perf.py#L150) results for a single-actor acceler…
-
I have a setup with 2 nodes with kubevirt VMs running with 2 gpus,
` mpirun --allow-run-as-root --show-progress -H 10.194.9.3,10.194.10.5 -map-by node -np 2 -x PATH -x NCCL_IB_GID_INDEX=3 -x NCCL_D…
-
**Please describe the bug**
**Please describe the expected behavior**
**System information and environment**
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04, docker):Linux Ubuntu 18.04…
-
I met the situation when I trained AllSpark on 2 RTX 3090. I have tried so many ways such as increasing 'timeout' of init_process_group, increasing NCCL_BUFFSIZE, set NCCL_P2P_LEVEL=NVL. But all of th…
-
https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html
-
See full log below. It can handle the first few requests and then getting stuck
```
2024-02-03 09:54:56,181 INFO worker.py:1724 -- Started a local Ray instance.
INFO 02-03 09:54:57 llm_engine.py:70…
-
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1616554786529/work/torch/lib/c10d/ProcessGroupNCCL.cpp:33, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA func…