-
First thanks for the excellent work! I have been using `torch.distributed.launch` to lauch training on a two-node cluster, each with 8 GPUs, and I found the training is extremely slow (~7x slower than…
-
## 🐛 Bug
Returning None from training_step with multi GPU DDP training freezes the training without exception
### To Reproduce
Starting multi-gpu training with a None-returning training_step fu…
-
Getting this error while pretraining LLama2 on A100 gpu. Using NCCL version 2.19.3. Running it on single vm with single A100 GPU.
Spotllm:73025:73025 [0] NCCL INFO Bootstrap : Using eth0:10.0.0.4
…
-
### 🐛 Describe the bug
Hi,
I'm doing experiments with distributed training with torch (related to [this](https://github.com/pytorch/pytorch/issues/120428)). I found that when I'm training my model…
-
Hi and thanks for you're great work and for making the code public!
I tried to run fully supervised training on a single GPU, but unfortunately the validation values are very low and both training …
-
Hello, there was an issue during the training. Is this a data reading issue? Thanks!!
The error is as follows:
upr-base => val step: 1: 104/119; time: 0.00+0.27
upr-base => val step: 1: 105/119; ti…
-
With SparkNet + TensorFrame using javacpp presets focusing on data parallel model training but to run model parallel training using spark/scala, we need tensorflow distributed_runtime to be exposed as…
-
I happen to find the release training code seems to be super slow compared to the original (internal) implementation when training on 8GPUs. It seems the single GPU training does not suffer from this.…
-
this method is crucial in distributed training yet i found this name very confusing. regarding the manual, the only reference to it seems to be
``You then set the epoch length explicitly with the …
-
See more details: https://github.com/pytorch/pytorch/issues/38174
cc @borda @tchaton @rohitgr7 @akihironitta @awaelchli