Open ghost opened 4 years ago
Hi @kalyanks0611
I did some preliminary experiments with wrapping the model in DataParallel and training on two GPUs.
However, the speed was worse compared to training on a single GPU. So I didn't follow up on this.
If someone gets this working (+ speedup compared to training on one GPU), I would be happy if the code could be shared here.
In general, when a model is trained using multiple GPU, training should be much faster. Any thoughts on, " why the speed was worse compared to training on a single GPU?" @nreimers
Hi @kalyanks0611 A challenge when training on multi-GPU is the communication overhead between the two GPUs. Often, sending data from one to the other GPU is quite slow. After each gradient step, the gradients are synced between the GPUs. This drastically decreases the performance.
At least in 2017, Pytorch DataParallel was not really efficient: https://github.com/pytorch/fairseq/issues/34
I don't know if this has improved since then. As mentioned, on the servers I tested, I saw a significant speed drop. Maybe this has changed with more recent versions of Pytorch / Transformers.
What about using DistributedDataParallel?
DistributedDataParallel is for having multiple servers. Haven't tested that, but there the communication overhead is even larger.
In fact, DDP also can be used on one machine. And as stated in the following tutorial, DDP is faster than DataParallel even on a single node. https://pytorch.org/tutorials/intermediate/ddp_tutorial.html
Hi @zhangdan8962 That is interesting. I will have a look
To overcome the issue in DataParallel, there is a PyTorch package called PyTorch-Encoding.
from parallel import DataParallelModel, DataParallelCriterion
parallel_model = DataParallelModel(model) # Encapsulate the model
parallel_loss = DataParallelCriterion(loss_function) # Encapsulate the loss function
predictions = parallel_model(inputs) # Parallel forward pass
# "predictions" is a tuple of n_gpu tensors
loss = parallel_loss(predictions, labels) # Compute loss function in parallel
loss.backward() # Backward pass
optimizer.step() # Optimizer step
predictions = parallel_model(inputs) # Parallel forward pass with new parameters
(this code taken from https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255 )
@nreimers
A simple implementation:https://github.com/liuyukid/sentence-transformers/blob/master/sentence_transformers/SentenceTransformer.py I don't know if the speed can be improved, but at least support larger batch_size You can try it!
Hi, anyone had success with parallelizing SentenceTransformer training to multiple GPUs using the PyTorch-Encoding approach that @kalyanks0611 brought up two comments above?
Hey, +1ing the above comment, any update on multi gpu training?
Hey @challos , I was able to make it work using a pretty ancient version of sentence transformers (0.38 because I had to). I think that if you can use the up to date version, they have some native multi-GPU support. If not, I found this article from one of Huggingface guys instrumental. He refers to a piece of code from zhanghang1989
(on github), which I was able to use almost verbatim (I think there was a small bug there for my use case but it is mostly useable a is - if you see a crash you'll know how to fix it):
Get through the explanation in that article - it is somewhat dense but useful in the end. And the code does just that.
Do we have any update on Multi GPU Training?
Any update on this? thanks
A simple implementation:https://github.com/liuyukid/sentence-transformers/blob/master/sentence_transformers/SentenceTransformer.py I don't know if the speed can be improved, but at least support larger batch_size You can try it!
I tried this code, to train on 1 worker 4 GPUs, it is not faster, about the same speed as 1 worker 1 GPU. Anybody has good ideas?
can not find a solution.
Got same result here with 4GPU, no acceleration (only the batch size increased by 4x)
Hi @kalyanks0611
I did some preliminary experiments with wrapping the model in DataParallel and training on two GPUs.
However, the speed was worse compared to training on a single GPU. So I didn't follow up on this.
If someone gets this working (+ speedup compared to training on one GPU), I would be happy if the code could be shared here.
Hi, Will you implement multi-GPU code? Because with the improvement of computing resources, everyone is no longer satisfied with using 2 GPUs, but uses more GPUs.
Hello @zhanxlin,
Multi-GPU support is being introduced in the upcoming v3.0 release of Sentence Transformers (planned in a few weeks). See v3.0-pre-release for the code, in case you already want to play around with it. I think the following should work:
pip install git+https://github.com/UKPLab/sentence-transformers@v3.0-pre-release
There's some details in #2449 about how the training will be changed, and how to use MultiGPU training. But to give you a sneak peek on the latter:
torchrun
or accelerate
instead of python
.As you can imagine, this results in very notable training speedups.
Hi @tomaarsen Any idea when the release exact date is?
Hello @bely66,
I'm preparing for the release to be this week. I can't promise an exact date as there might be some unexpected issues.
I have trained SBERT model from scratch using the code https://github.com/UKPLab/sentence-transformers/blob/master/examples/training_transformers/training_nli.py and https://github.com/UKPLab/sentence-transformers/blob/master/examples/training_transformers/training_stsbenchmark_continue_training.py on a single GPU.
Now, I would like to train the model from scratch using two GPUs. I'm not sure regarding the changes I have to makes in the above code so that I can train the model using two GPUs.
@nreimers