Open stephantul opened 3 weeks ago
Batch inferencing causes a big drop in performance. When running:
model.encode(["I love sun bathing, but I always use sun screen!","I love sun bathing, but I always use sun screen!"])
it takes much longer than running 2x:
model.encode(["I love sun bathing, but I always use sun screen!"])
Is there any bottleneck for batch inferencing?
This is my testing code:
# pip install model2vec
from model2vec import StaticModel
# Load a pretrained Model2Vec model, this is the static version of BAAI/bge-base-en-v1.5
model = StaticModel.from_pretrained("minishlab/M2V_base_output")
Running in Jupyter with Python 3.11 on an M3 Max
%%time
for i in range(100_000):
model.encode(["I love sun bathing, but I always use sun screen!",
"I love sun bathing, but I always use sun screen!"])
# CPU times: user 1min 1s, sys: 4min 1s, total: 5min 3s # So much more CPU time here
# Wall time: 21.6 s
%%time
for i in range(100_000):
model.encode(["I love sun bathing, but I always use sun screen!"])
# CPU times: user 5.28 s, sys: 106 ms, total: 5.39 s
# Wall time: 5.39 s
Hi! Thanks for the comment.
Interesting. Neither me nor @Pringled were able to reproduce the slowdown. I think this is related to the dispatch in the rust tokenizer. It will spawn multiple workers, but because the strings are so short and there are only two, dispatching the workers only leads to overhead.
Could you run this snippet? It requires the datasets
library. The timings in the snippet are made with a regular m3 macbook.
from datasets import load_dataset
from model2vec import StaticModel
model = StaticModel.from_pretrained("minishlab/M2V_base_output")
ds = load_dataset("wikimedia/wikipedia", data_files="20231101.en/train-00000-of-00041.parquet")["train"]
text = ds["text"][:5000]
%time _ = [model.encode(x) for x in text]
# CPU times: user 4.14 s, sys: 468 ms, total: 4.61 s
# Wall time: 4.63 s
%time _ = model.encode(text)
# CPU times: user 7.81 s, sys: 513 ms, total: 8.33 s
# Wall time: 935 ms
# If we limit to 2 docs
text = text[:2]
%time _ = [model.encode(x) for x in text]
# CPU times: user 3.55 ms, sys: 3.05 ms, total: 6.6 ms
# Wall time: 5.83 ms
%time _ = model.encode(text)
# CPU times: user 3.13 ms, sys: 427 μs, total: 3.56 ms
# Wall time: 1.8 ms
So, on my machine we see a monotonic increase, even with two texts. If your issue still appears with 100, 1000 or 10k texts, that would be very bad of course. Curious to see what comes out of it.
Thanks, I ran the script:
CPU times: user 3.97 s, sys: 68.4 ms, total: 4.04 s
Wall time: 4.04 s
CPU times: user 13.3 s, sys: 2.19 s, total: 15.5 s
Wall time: 1.18 s
CPU times: user 22.5 ms, sys: 36.5 ms, total: 59 ms
Wall time: 3.78 ms
CPU times: user 6.5 ms, sys: 15.5 ms, total: 22 ms
Wall time: 1.43 ms
I tested and the issue persists also with 10 repetitions:
%%time
for i in range(10):
model.encode(["I love sun bathing, but I always use sun screen!","I love sun bathing, but I always use sun screen!"])
# CPU times: user 14.9 ms, sys: 40.7 ms, total: 55.6 ms
# Wall time: 4.76 ms
%%time
for i in range(10):
model.encode(["I love sun bathing, but I always use sun screen!"])
# CPU times: user 2.59 ms, sys: 1.23 ms, total: 3.82 ms
# Wall time: 2.47 ms
For my use case it's not a problem, I will just avoid batch processing for now. However, for new users who might be used to the fact that batch inferencing always gives a massive speed boost, that might be counterintuitive.
Hey @do-me ,
In your last example, the wall time of the second run is approximately half that of the first one. But you also process half as many examples. If you kept the number of items constant (i.e., you doubled the number of iterations of the second loop), you'd get approximately the same wall time.
To expand a bit more: the metric we're interested in is wall time, not CPU or user time. In fact, wall time is only lower because CPU time is higher: we use batch processing in the rust backed tokenizer, which creates separate worker processes for each text. So each "wall time" second can translate to many CPU seconds, and because the process runs in user space, also as many user seconds.
So what you are seeing in the data above is exactly what one would expect given that we're multiprocessing. You can repeat the experiments using the %timeit
macro to see what I mean.
n = 2
sentence = ["I love sun bathing, but I always use sun screen!"]
# Batch size N
%timeit model.encode(sentence * n)
# 138 us
# N times batch size 1
%timeit [model.encode(x) for x in (sentence * n)]
# 125 us
This leads to slightly higher processing times for batch size 2, which is indeed surprising, but it's only 138 vs 125 microseconds on my machine. The higher the batch size, the more savings you'd get, which you can simulate by making n
higher. Also note that this effect completely disappears if you use longer texts:
n = 2
sentence = ["I love sun bathing, but I always use sun screen! " * 100]
# Batch size N
%timeit model.encode(sentence * n)
# 1.39 ms
# N times batch size 1
%timeit [model.encode(x) for x in (sentence * n)]
# 2.35 ms
So for a longer string, not batching already means a huge drop in performance, even at batch size 2.
I hope this helps! Stéphan
In your last example, the wall time of the second run is approximately half that of the first one. But you also process half as many examples. If you kept the number of items constant (i.e., you doubled the number of iterations of the second loop), you'd get approximately the same wall time.
Sorry for the confusing example, but no, it's not the same for n=2. In fact batch processing takes twice as long for this example:
%%time
for i in range(1000):
model.encode(["I love sun bathing, but I always use sun screen!","I love sun bathing, but I always use sun screen!"])
# CPU times: user 830 ms, sys: 3.18 s, total: 4.01 s
# Wall time: 271 ms
%%time
for i in range(2000):
model.encode(["I love sun bathing, but I always use sun screen!"])
# CPU times: user 150 ms, sys: 3.95 ms, total: 154 ms
# Wall time: 153 ms
However, apparently this applies only to very small batches with n < 4. Already when n=4, it's about the same:
n = 4
sentence = ["I love sun bathing, but I always use sun screen!"]
# Batch size N
%timeit model.encode(sentence * n)
# N times batch size 1
%timeit [model.encode(x) for x in (sentence * n)]
# 294 µs ± 18.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# 294 µs ± 5.81 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
For n = 100, it's how it's supposed to be with huge gains for batches:
# 1.04 ms ± 15.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# 7.77 ms ± 213 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So my tl;dr is: If the batch is below 4 and you work with short texts don't batch (at least on my system)
Heya @do-me ,
I looked into this a bit as well as it's a bit unexpected that batches < 4 would work better without batching. I think the issue might be something else (e.g. the initialisation of the lists inside the loop). Could you perhaps run the following code on your system:
import timeit
import numpy as np
def measure_time(stmt, globals, loops, runs):
times = timeit.repeat(stmt, globals=globals, repeat=runs, number=loops)
mean_time = np.mean(times) / loops
std_dev_time = np.std(times) / loops
return mean_time, std_dev_time
# Number of runs and repetitions
runs = 7
loops = 1000
sentence = ["I love sun bathing, but I always use sun screen!"]
for n in range(1, 10):
# Pre-create the input data outside the loops
batch_sentence = sentence * n
individual_sentences = [sentence[0] for _ in range(n)]
print(f"n = {n}")
# Batch size N
mean_batch, std_batch = measure_time("model.encode(batch_sentence)", globals(), loops, runs)
print(f"Batch size N: {mean_batch * 1e6:.2f} µs ± {std_batch * 1e6:.2f} µs per loop (mean ± std. dev. of {runs} runs, {loops} loops each)")
# N times batch size 1
mean_individual, std_individual = measure_time("[model.encode(x) for x in individual_sentences]", globals(), loops, runs)
print(f"N times batch size 1: {mean_individual * 1e6:.2f} µs ± {std_individual * 1e6:.2f} µs per loop (mean ± std. dev. of {runs} runs, {loops} loops each)\n")
For me, this gives the following output:
n = 1
Batch size N: 65.53 µs ± 8.80 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
N times batch size 1: 66.99 µs ± 10.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
n = 2
Batch size N: 152.86 µs ± 18.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
N times batch size 1: 130.02 µs ± 5.02 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
n = 3
Batch size N: 165.88 µs ± 1.19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
N times batch size 1: 185.85 µs ± 6.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
n = 4
Batch size N: 157.36 µs ± 8.73 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
N times batch size 1: 253.65 µs ± 4.83 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
n = 5
Batch size N: 161.62 µs ± 4.49 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
N times batch size 1: 317.99 µs ± 8.00 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
n = 6
Batch size N: 199.23 µs ± 11.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
N times batch size 1: 381.24 µs ± 7.85 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
n = 7
Batch size N: 194.28 µs ± 11.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
N times batch size 1: 475.11 µs ± 17.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
n = 8
Batch size N: 203.56 µs ± 18.69 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
N times batch size 1: 548.64 µs ± 17.58 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
n = 9
Batch size N: 199.27 µs ± 4.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
N times batch size 1: 579.28 µs ± 13.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So only for N=2 there is a slight improvement when not batching, which I think is related to what @stephantul mentioned above w.r.t multiprocessing.
The effect persists, 4-5 is still the magic number on my system, don't ask me why :D Seems like there is some "base cost" in initializing the tokenizer or similar that is outweighed for n>4 but not before.
@do-me Interesting! I get the following result on my system:
Not sure what causes the difference. In any case, thanks for testing this, and do let us know if you encounter any other performance quirks!
Sure thanks a lot for being so responsive, it's a joy to work with this library this way :) I do have some other questions, but will ask them in the discussions.
Is your sentence transformer distillation not as good as you hoped?
Post it here! We'd love to help you figure out your particular issue.
💘 Stéphan & Thomas