-
I have tried training the volumetric model on the CMU dataset, but am encountering more problems with training. The model is able to successfully train an epoch from checkpoint of the previous epoch, …
-
I was debugging the following issue in PyTorch with regards to nccl send/recv: https://github.com/pytorch/pytorch/issues/50092. I tried to see if I could somehow reproduce the issue in NCCL itself to …
-
Running distributed training on two AWS p4d.24xlarge instances and getting
```
1,1]: File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 911, in irecv
[1,1]:R…
-
![image](https://user-images.githubusercontent.com/25666696/127292727-6670e5f0-7bd5-43b0-82f7-6d5178a04645.png)
-
I was running some benchmarks with torch-ucc using xccl for collectives, and I noticed very bad performance compared to NCCL. See numbers here: https://gist.github.com/froody/a86a5b2c5d9f46aedba7e930f…
-
## 🐛 Bug
This issue is related to #42107: [torch.distributed.launch: despite errors, training continues on some GPUs without printing any logs](https://github.com/pytorch/pytorch/issues/42107), whi…
-
I found this MoE runs on DeepSpeed, but deepspeed has issues when runing on server without MPI. Any solution?
-
I am working on this project with RTX A6000-48G and I have met some bugs
my command is
`torchrun --nproc_per_node=4 main.py configs/training/train_resnet18_w2to6_a2to6.yaml`
nvidia-smi
``` +-----…
-
The example is running on the NCCL backend for distributed GPU settings. I'm wondering if it can profile correctly on a multi-node (multiple CPU servers) distributed CPU settings with Gloo backend?
…
-
### 🐛 Describe the bug
When I upgrade to PyTorch 2.2 via Pip, importing torch fails with an undefined symbol error:
```
Traceback (most recent call last):
File "", line 1, in
File "/scratc…