i'm getting 54 sentences/s on inference with BERT on T4 GPU, is that good?

deepset-ai / FARM

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

https://farm.deepset.ai

Apache License 2.0

1.74k stars 247 forks source link

i'm getting 54 sentences/s on inference with BERT on T4 GPU, is that good? #368

Closed renaud closed 4 years ago

renaud commented 4 years ago

Question

I am getting around 54 sentences/s on inference for text classification.

What do you think? is that good? Does this compare with what you get?

Additional context

lang_model = "bert-base-cased"
do_lower_case = False
max_seq_len = 64
use_amp = None

GCP
1 x T4 GPU
n1-standard-2 (2 vCPUs, 7.5 GB memory)

Timoeller commented 4 years ago

Hey @renaud we are currently doing a lot of inference benchmarking for Question Answering like described in deepset-ai/haystack/issues/39 where we also compare pytorch vs ONNX.

Concerning your throughput. I think it is pretty slow - but I am not sure how a T4 GPU performs compared to the V100s we used. One important paramter is the batch size. Did you test different batch size values?

Looking at Tanays post it takes 0.1621 seconds for a batch of size 64 to complete on a V100. That would make about 395 samples per sec. And this is done on QA, where the inference is much! more complex (lot of communication between GPU and CPU). Simple text classification should happen faster - my intuitive guess would be by a factor of 2-5 times...

Happy to interact here and make the textclassification inference faster together with you!

Timoeller commented 4 years ago

I did some speed benchmarking on text classification inference. I used a V100 GPU and tested various batch size + max_seq_len values with inference on 1000 texts:

seq len 256, (batch size, seconds) [(1, 15.71), (10, 5.47), (20, 5.31), (30, 5.31)]
seq len 128, (batch size, seconds) [(1, 14.99), (10, 3.48), (20, 3.73), (30, 3.57)]

So deviding 1000/3.57 we get 280 samples/second for seq len=128 and batch size=30 I would suggest you should try increasing the batch size. A T4 will still be slower than a V100, but 54 samples/s is really low. And I also realized that I might be wrong about textclassification inference being faster than QA inference - the numbers are comparable to a recent QA inference benchmark test.

Timoeller commented 4 years ago

Ok, why talking about intuition when one can also just check.

I tested QA inference vs Textclassification:

Text Classification on 1000 docs, max seq len 128 Batch size: 1, takes 14.947 Batch size: 3, takes 5.801 Batch size: 6, takes 3.904 Batch size: 10, takes 3.771 Batch size: 20, takes 3.758 Batch size: 30, takes 3.667

Question Answering on 1000 questions , max seq len 128 (doc + question just below 128 tokens) Batch size: 1, takes 16.096 Batch size: 3, takes 6.172 Batch size: 6, takes 5.290 Batch size: 10, takes 4.951 Batch size: 20, takes 5.044 Batch size: 30, takes 4.930

So QA inference seems a bit slower than TextClassification inside FARM (0.4.4)

Timoeller commented 4 years ago

closing this now for inactivity, feel free to reopen

deepset-ai / FARM

i'm getting 54 sentences/s on inference with BERT on T4 GPU, is that good? #368

seq len 128, (batch size, seconds) [(1, 14.99), (10, 3.48), (20, 3.73), (30, 3.57)]