-
Hi. Thanks for the amazing work. Am trying to run it on windows environment python 3.10 but i couldn't. Am getting this error.... Collecting nvidia-nccl-cu12
Downloading nvidia-nccl-cu12-0.0.1.dev…
Abocg updated
1 month ago
-
thank you for attention this problem.
my workstation spec is
RTX A4000 *2
WSL2_Ubuntu-22.04
cudnn 8.9
(base) heartlab@DESKTOP-GGBQPHK:~/nccl-tests$ nvidia-smi
Fri Jun 28 05:15:17 2024
+------…
-
Hi
I am getting the following error when I do make build
```
enqueue.cc: In function 'ncclResult_t ncclEnqueueCheck(ncclInfo*)':
enqueue.cc:2025:25: error: expected ')' before 'PRIx64'
2025 | …
-
## Describe the Bug
After running a ResNet50 or TinyLlama2 workload on 4 ranks I see that in the Kineto trace at least one nccl:broadcast collective is observed. In the trace_link file the same colle…
-
This is from trying to to update the spack package to 2.6.2 and provide NCCL/RCCL support, but it doesn't look as if it's related to spack. Building fails when I enable NCCL, but works without it; I'…
-
### System Info
```Shell
Accelerate 0.34.2
Numpy 1.26.4
(Singularity container based on Ubuntu 22.04)
```
### Information
- [X] The official example scripts
- [ ] My own modified scripts
### Ta…
-
got NCCL with WARN socketTryAccept: Accept failed: Bad file descriptor during distributed trainig. Both pytorch and Jax have tried. They have similar problems
System info:
Gpus: 4090
Cuda: …
-
### Your current environment
```text
Collecting environment information...
PyTorch version: 2.1.2+cu118
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A
…
-
I can add a NCCL tests example but before I do would be great to see if that's something that would be accepted.
-
当我在linux服务器上用两个GPU尝试train的时候,遇到一个报错,
return torch._C._dist_broadcast(tensor, src, group)
RuntimeErrorreturn torch._C._dist_broadcast(tensor, src, group):
NCCL error in: /opt/conda/conda-bld/py…