-
### Your current environment
```text
Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
…
-
Hello NERSC Team,
HPCD CISL staff at NSF NCAR would like to request a performance comparison of your use of this nccl-ofi-plugin to ours as copied below. We used a test suite that expands a bit on …
dphow updated
2 weeks ago
-
I have a setup with 2 nodes with kubevirt VMs running with 2 gpus,
` mpirun --allow-run-as-root --show-progress -H 10.194.9.3,10.194.10.5 -map-by node -np 2 -x PATH -x NCCL_IB_GID_INDEX=3 -x NCCL_D…
-
Hi,
I am using AWS Sagemaker instance ml.g5.48xlarge which has 8 A10 Nvidia GPUs. I have 4 scripts each accessing 2 GPUs each. I am using vLLM to load mixtral LLM onto the respective GPUs such as f…
-
### 🐛 Describe the bug
Async NCCL comminucations from `torch.distributed` should run in parallel with CUDA computing kernels, but traces from `torch.profiler` shows it is not true for the first run. …
-
rank1]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[r…
-
### 🐛 Describe the bug
torchrun Multi machine and multi card training error.
Both Rank1 and Rank2 can be trained normally.
The error occurs after the successful establishment of nccl communicatio…
-
## error
When I use 2 GPU to train flux lora, everything is fine, successful training~, but when I use one GPU or start with 2GPU, but use one, it start to have the error bellow, the code is latest c…
-
hi, developer.
I meet some stuck problem while using nccl-test to test. The detail:
i have done all step follw by https://github.com/NVIDIA/nccl. but when i run "./build/all_reduce_perf -b 8 -e 1…
-
I have two servers, Dell and FusionServer, nccl-test don't work ,but if all servers is same model,the ncct-test can work
my environment
```
os: ubuntu 22.04
cuda: 12.4
NV drvier: 550
```
wh…
SdEnd updated
1 month ago