-
I am running Tensorboard on distributed training logs. I can see operations on different parameter servers. They are color coded differently with device placement toggle. But, I can’t see operations r…
ghost updated
5 years ago
-
I happen to find the release training code seems to be super slow compared to the original (internal) implementation when training on 8GPUs. It seems the single GPU training does not suffer from this.…
-
## 🐛 Bug
This issue is related to #42107: [torch.distributed.launch: despite errors, training continues on some GPUs without printing any logs](https://github.com/pytorch/pytorch/issues/42107), whi…
-
Hello,
I came across your work, and was wondering whether loading and training models on multiple GPUs was possible.
I saw in the YOLOv7 repo that it was possible with the following command line…
-
hello, when i set 2 to sampler_per_gpu in /projects/configs/surroundocc/surroundocc.py,the problem is shown as follows:
RuntimeError: stack expects each tensor to be equal size, but got [62812, 4] at…
-
First thanks for the excellent work! I have been using `torch.distributed.launch` to lauch training on a two-node cluster, each with 8 GPUs, and I found the training is extremely slow (~7x slower than…
-
## 🐛 Bug
Returning None from training_step with multi GPU DDP training freezes the training without exception
### To Reproduce
Starting multi-gpu training with a None-returning training_step fu…
-
**Describe the bug**
I used the Transformers library with Deepspeed, and used Lora to fine-tune the CogVLM2 model. The parameter size of the model was 19B. During training, I used five graphics cards…
-
### 🐛 Describe the bug
Hi,
I'm doing experiments with distributed training with torch (related to [this](https://github.com/pytorch/pytorch/issues/120428)). I found that when I'm training my model…
-
Hello,
is it possible to run one instance of stable diffusion and connect multiple computers to increase overall GPU capacity and power?
I have 10 computers in total.