UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.89k stars 2.44k forks source link

No speedup when batching Sentence Transformer Bert #609

Open PetreanuAndi opened 3 years ago

PetreanuAndi commented 3 years ago

Hey guys. I get no benefit from batching (no speedup whatsoever) Sentence-Transformer. I would love your opinion on the following situation :

I run inferences on 'bert-base-nli-mean-tokens' model with fake input, for the sake of bench-marking : 'text ' * 512 ( basically i want to make sure the batches always have 512 tokens ) 'text text text .... text '

If I run with 128 batch-size, I get inference speed of about 1.82 seconds (on average) for the whole batch. I'm only timing the model.encode method.

If i run with 32 batch-size, I get inference speed of about 0.46 seconds (on average) for the whole batch. Now, 0.46 * 4 = 1.84 ==> batching does nothing to help me with optimisation / paralelisation of inference etc.

I was expecting better results, in terms of inference speed, with higher batches. Locally, I am on a 2080Ti GPU machine, with "544 tensor cores" as per NVidia documentation. I have replicated the same exact behavior on a K80 device in the cloud.(much lower inference speeds overall, but the same no-speedup effect with larger batches)

Why is Sentence-Transformer behaving this way? Thank you, I would really appreciate some insights.

nreimers commented 3 years ago

Increasing the batch size only helps when your GPU is under-utilized. If your GPU is at 100% with batch size of 32, you will not see an improvement when going to batch size of 128.

In contrast, if you have a larger batch, more padding is needed to bring all tokens to the same length and more compute time is wasted by the GPU on these padding tokens.

So depending on the model, you get often an optimal speed with batch sizes between 16 and 128. For really small models with e.g. only 2 layers, larger batches might be helpful (there, the python based tokenizer can be a bottle neck).

PetreanuAndi commented 3 years ago

hello @nreimers , and thanks for your input. As i was saying, all my inputs have fixed 512 tokens, same as the max net input size. (such that there is no padding necessary) When running with 32 batch-size, my 2080Ti GPU really is under-utilized, that's why I was expecting a boost in performance when moving to 128 batch size.

The model is "bert-base-nli-mean-tokens". S-BERT. Off-the-shelf with pretrained weights.

My observation is that there is no difference in speed (128 vs 4 * 32 batch size) I have experimentally searched for that optimal speed you are talking about, but no luck in finding it.

I'm pretty sure this is related to some implementation detail of the sentence-transformer encode call. Does the model perform inference sequentially over all elements in the batch? It actually seems like it does :) Will investigate in code..

Any further insights are welcomed.

nreimers commented 3 years ago

@PetreanuAndi The python tokenizer can be quite a bottle neck. You can try to use the fast tokenizers by Huggingface. Just call

model.tokenizer = AutoTokenizer.from_pretrained('hugginface_model_name_or_path', use_fast=True)
HodorTheCoder commented 3 years ago

@PetreanuAndi did you ever figure this out? I'm encoding massive quantities (about 800k sentences) and I was hoping to get a nice boost going from 32 batch to 128, but no such luck.

I haven't looked under the hood yet, but I know for instance with Huggingface transformer models in native PyTorch, for example when doing a bert2bert encoder/decoder model, increasing batch size for inference has a noticeable speed boost and I was hoping it would be the same here. I notice that increasing the batch size roughly 4x the amount of memory being used by the Tesla T4, but the GPU-Util stays the same at roughly 96%.

Did you have any luck with the fast tokenzier?

nreimers commented 3 years ago

Hi, Increasing your batch size only works when your GPU util is low. But with 96% util there is not much improvement left that could be achieved.

Increasing the batch size in that case actually decreases the performance as more padding is needed to make batches uniform in length.

Fast tokenizer is automatically used with transformers 4.

qrdlgit commented 1 year ago

@nreimers Might be a good idea to mention this in the docs.