Multi-GPU training - Githubissues

ghost commented 4 years ago

I have trained SBERT model from scratch using the code https://github.com/UKPLab/sentence-transformers/blob/master/examples/training_transformers/training_nli.py and https://github.com/UKPLab/sentence-transformers/blob/master/examples/training_transformers/training_stsbenchmark_continue_training.py on a single GPU.

Now, I would like to train the model from scratch using two GPUs. I'm not sure regarding the changes I have to makes in the above code so that I can train the model using two GPUs.

@nreimers

nreimers commented 4 years ago

Hi @kalyanks0611

I did some preliminary experiments with wrapping the model in DataParallel and training on two GPUs.

However, the speed was worse compared to training on a single GPU. So I didn't follow up on this.

If someone gets this working (+ speedup compared to training on one GPU), I would be happy if the code could be shared here.

ghost commented 4 years ago

In general, when a model is trained using multiple GPU, training should be much faster. Any thoughts on, " why the speed was worse compared to training on a single GPU?" @nreimers

nreimers commented 4 years ago

Hi @kalyanks0611 A challenge when training on multi-GPU is the communication overhead between the two GPUs. Often, sending data from one to the other GPU is quite slow. After each gradient step, the gradients are synced between the GPUs. This drastically decreases the performance.

At least in 2017, Pytorch DataParallel was not really efficient: https://github.com/pytorch/fairseq/issues/34

I don't know if this has improved since then. As mentioned, on the servers I tested, I saw a significant speed drop. Maybe this has changed with more recent versions of Pytorch / Transformers.

zhangdan8962 commented 4 years ago

What about using DistributedDataParallel?

nreimers commented 4 years ago

DistributedDataParallel is for having multiple servers. Haven't tested that, but there the communication overhead is even larger.

zhangdan8962 commented 4 years ago

In fact, DDP also can be used on one machine. And as stated in the following tutorial, DDP is faster than DataParallel even on a single node. https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

nreimers commented 4 years ago

Hi @zhangdan8962 That is interesting. I will have a look

ghost commented 4 years ago

To overcome the issue in DataParallel, there is a PyTorch package called PyTorch-Encoding.

from parallel import DataParallelModel, DataParallelCriterion

parallel_model = DataParallelModel(model)             # Encapsulate the model
parallel_loss  = DataParallelCriterion(loss_function) # Encapsulate the loss function

predictions = parallel_model(inputs)      # Parallel forward pass
                                          # "predictions" is a tuple of n_gpu tensors
loss = parallel_loss(predictions, labels) # Compute loss function in parallel
loss.backward()                           # Backward pass
optimizer.step()                          # Optimizer step
predictions = parallel_model(inputs)      # Parallel forward pass with new parameters

(this code taken from https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255 )

@nreimers

liuyukid commented 3 years ago

A simple implementation：https://github.com/liuyukid/sentence-transformers/blob/master/sentence_transformers/SentenceTransformer.py I don't know if the speed can be improved, but at least support larger batch_size You can try it!

genaunit commented 2 years ago

Hi, anyone had success with parallelizing SentenceTransformer training to multiple GPUs using the PyTorch-Encoding approach that @kalyanks0611 brought up two comments above?

ajmcgrail commented 1 year ago

Hey, +1ing the above comment, any update on multi gpu training?

genaunit commented 1 year ago

Hey @challos , I was able to make it work using a pretty ancient version of sentence transformers (0.38 because I had to). I think that if you can use the up to date version, they have some native multi-GPU support. If not, I found this article from one of Huggingface guys instrumental. He refers to a piece of code from zhanghang1989 (on github), which I was able to use almost verbatim (I think there was a small bug there for my use case but it is mostly useable a is - if you see a crash you'll know how to fix it):

https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255

Get through the explanation in that article - it is somewhat dense but useful in the end. And the code does just that.

prvnktech commented 1 year ago

Do we have any update on Multi GPU Training?

shoegazerstella commented 1 year ago

Any update on this? thanks

liqi6811 commented 12 months ago

A simple implementation：https://github.com/liuyukid/sentence-transformers/blob/master/sentence_transformers/SentenceTransformer.py I don't know if the speed can be improved, but at least support larger batch_size You can try it!

I tried this code, to train on 1 worker 4 GPUs, it is not faster, about the same speed as 1 worker 1 GPU. Anybody has good ideas?

sangyongjia commented 12 months ago

can not find a solution.

dkchhetri commented 7 months ago

https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255

Got same result here with 4GPU, no acceleration (only the batch size increased by 4x)

zhanxlin commented 2 months ago

Hi @kalyanks0611

I did some preliminary experiments with wrapping the model in DataParallel and training on two GPUs.

However, the speed was worse compared to training on a single GPU. So I didn't follow up on this.

If someone gets this working (+ speedup compared to training on one GPU), I would be happy if the code could be shared here.

Hi, Will you implement multi-GPU code? Because with the improvement of computing resources, everyone is no longer satisfied with using 2 GPUs, but uses more GPUs.

tomaarsen commented 2 months ago

Hello @zhanxlin,

Multi-GPU support is being introduced in the upcoming v3.0 release of Sentence Transformers (planned in a few weeks). See v3.0-pre-release for the code, in case you already want to play around with it. I think the following should work:

pip install git+https://github.com/UKPLab/sentence-transformers@v3.0-pre-release

There's some details in #2449 about how the training will be changed, and how to use MultiGPU training. But to give you a sneak peek on the latter:

Data Parallelism is automatically applied if you use multiple GPUs
Distributed Data Parallelism is automatically applied if you run the training script with torchrun or accelerate instead of python.

As you can imagine, this results in very notable training speedups.

Tom Aarsen

bely66 commented 1 month ago

Hi @tomaarsen Any idea when the release exact date is?

tomaarsen commented 1 month ago

Hello @bely66,

I'm preparing for the release to be this week. I can't promise an exact date as there might be some unexpected issues.

Tom Aarsen

UKPLab / sentence-transformers

Multi-GPU training #311