MinishLab / model2vec

Distill a Small Static Model from any Sentence Transformer
MIT License
364 stars 16 forks source link

🏃‍♂️ Performance issues megathread #56

Open stephantul opened 3 weeks ago

stephantul commented 3 weeks ago

Is your sentence transformer distillation not as good as you hoped?

Post it here! We'd love to help you figure out your particular issue.

💘 Stéphan & Thomas

do-me commented 1 week ago

Batch inferencing performance drop

Batch inferencing causes a big drop in performance. When running:

model.encode(["I love sun bathing, but I always use sun screen!","I love sun bathing, but I always use sun screen!"])

it takes much longer than running 2x:

model.encode(["I love sun bathing, but I always use sun screen!"])

Is there any bottleneck for batch inferencing?

This is my testing code:

# pip install model2vec

from model2vec import StaticModel

# Load a pretrained Model2Vec model, this is the static version of BAAI/bge-base-en-v1.5
model = StaticModel.from_pretrained("minishlab/M2V_base_output")

Running in Jupyter with Python 3.11 on an M3 Max

%%time
for i in range(100_000):
    model.encode(["I love sun bathing, but I always use sun screen!",
                               "I love sun bathing, but I always use sun screen!"])
# CPU times: user 1min 1s, sys: 4min 1s, total: 5min 3s # So much more CPU time here
# Wall time: 21.6 s
%%time
for i in range(100_000):
    model.encode(["I love sun bathing, but I always use sun screen!"])
# CPU times: user 5.28 s, sys: 106 ms, total: 5.39 s
# Wall time: 5.39 s
stephantul commented 1 week ago

Hi! Thanks for the comment.

Interesting. Neither me nor @Pringled were able to reproduce the slowdown. I think this is related to the dispatch in the rust tokenizer. It will spawn multiple workers, but because the strings are so short and there are only two, dispatching the workers only leads to overhead.

Could you run this snippet? It requires the datasets library. The timings in the snippet are made with a regular m3 macbook.

from datasets import load_dataset
from model2vec import StaticModel

model = StaticModel.from_pretrained("minishlab/M2V_base_output")
ds = load_dataset("wikimedia/wikipedia", data_files="20231101.en/train-00000-of-00041.parquet")["train"]
text = ds["text"][:5000]

%time _ = [model.encode(x) for x in text]
# CPU times: user 4.14 s, sys: 468 ms, total: 4.61 s
# Wall time: 4.63 s
%time _ = model.encode(text)
# CPU times: user 7.81 s, sys: 513 ms, total: 8.33 s
# Wall time: 935 ms

# If we limit to 2 docs
text = text[:2]
%time _ = [model.encode(x) for x in text]
# CPU times: user 3.55 ms, sys: 3.05 ms, total: 6.6 ms
# Wall time: 5.83 ms
%time _ = model.encode(text)
# CPU times: user 3.13 ms, sys: 427 μs, total: 3.56 ms
# Wall time: 1.8 ms

So, on my machine we see a monotonic increase, even with two texts. If your issue still appears with 100, 1000 or 10k texts, that would be very bad of course. Curious to see what comes out of it.

do-me commented 1 week ago

Thanks, I ran the script:

CPU times: user 3.97 s, sys: 68.4 ms, total: 4.04 s
Wall time: 4.04 s
CPU times: user 13.3 s, sys: 2.19 s, total: 15.5 s
Wall time: 1.18 s
CPU times: user 22.5 ms, sys: 36.5 ms, total: 59 ms
Wall time: 3.78 ms
CPU times: user 6.5 ms, sys: 15.5 ms, total: 22 ms
Wall time: 1.43 ms

I tested and the issue persists also with 10 repetitions:

%%time
for i in range(10):
    model.encode(["I love sun bathing, but I always use sun screen!","I love sun bathing, but I always use sun screen!"])
# CPU times: user 14.9 ms, sys: 40.7 ms, total: 55.6 ms
# Wall time: 4.76 ms
%%time
for i in range(10):
    model.encode(["I love sun bathing, but I always use sun screen!"])
# CPU times: user 2.59 ms, sys: 1.23 ms, total: 3.82 ms
# Wall time: 2.47 ms

For my use case it's not a problem, I will just avoid batch processing for now. However, for new users who might be used to the fact that batch inferencing always gives a massive speed boost, that might be counterintuitive.

stephantul commented 1 week ago

Hey @do-me ,

In your last example, the wall time of the second run is approximately half that of the first one. But you also process half as many examples. If you kept the number of items constant (i.e., you doubled the number of iterations of the second loop), you'd get approximately the same wall time.

To expand a bit more: the metric we're interested in is wall time, not CPU or user time. In fact, wall time is only lower because CPU time is higher: we use batch processing in the rust backed tokenizer, which creates separate worker processes for each text. So each "wall time" second can translate to many CPU seconds, and because the process runs in user space, also as many user seconds.

So what you are seeing in the data above is exactly what one would expect given that we're multiprocessing. You can repeat the experiments using the %timeit macro to see what I mean.

n = 2
sentence = ["I love sun bathing, but I always use sun screen!"]
# Batch size N
%timeit model.encode(sentence * n)
# 138 us
# N times batch size 1
%timeit [model.encode(x) for x in (sentence * n)]
# 125 us

This leads to slightly higher processing times for batch size 2, which is indeed surprising, but it's only 138 vs 125 microseconds on my machine. The higher the batch size, the more savings you'd get, which you can simulate by making n higher. Also note that this effect completely disappears if you use longer texts:

n = 2
sentence = ["I love sun bathing, but I always use sun screen! " * 100]
# Batch size N
%timeit model.encode(sentence * n)
# 1.39 ms
# N times batch size 1
%timeit [model.encode(x) for x in (sentence * n)]
# 2.35 ms

So for a longer string, not batching already means a huge drop in performance, even at batch size 2.

I hope this helps! Stéphan

do-me commented 1 week ago

In your last example, the wall time of the second run is approximately half that of the first one. But you also process half as many examples. If you kept the number of items constant (i.e., you doubled the number of iterations of the second loop), you'd get approximately the same wall time.

Sorry for the confusing example, but no, it's not the same for n=2. In fact batch processing takes twice as long for this example:

%%time
for i in range(1000):
    model.encode(["I love sun bathing, but I always use sun screen!","I love sun bathing, but I always use sun screen!"])
# CPU times: user 830 ms, sys: 3.18 s, total: 4.01 s
# Wall time: 271 ms
%%time
for i in range(2000):
    model.encode(["I love sun bathing, but I always use sun screen!"])
# CPU times: user 150 ms, sys: 3.95 ms, total: 154 ms
# Wall time: 153 ms

However, apparently this applies only to very small batches with n < 4. Already when n=4, it's about the same:

n = 4
sentence = ["I love sun bathing, but I always use sun screen!"]
# Batch size N
%timeit model.encode(sentence * n)
# N times batch size 1
%timeit [model.encode(x) for x in (sentence * n)]

# 294 µs ± 18.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# 294 µs ± 5.81 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

For n = 100, it's how it's supposed to be with huge gains for batches:

# 1.04 ms ± 15.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# 7.77 ms ± 213 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So my tl;dr is: If the batch is below 4 and you work with short texts don't batch (at least on my system)

Pringled commented 1 week ago

Heya @do-me ,

I looked into this a bit as well as it's a bit unexpected that batches < 4 would work better without batching. I think the issue might be something else (e.g. the initialisation of the lists inside the loop). Could you perhaps run the following code on your system:

import timeit
import numpy as np

def measure_time(stmt, globals, loops, runs):
    times = timeit.repeat(stmt, globals=globals, repeat=runs, number=loops)
    mean_time = np.mean(times) / loops
    std_dev_time = np.std(times) / loops
    return mean_time, std_dev_time

# Number of runs and repetitions
runs = 7
loops = 1000
sentence = ["I love sun bathing, but I always use sun screen!"]

for n in range(1, 10):
    # Pre-create the input data outside the loops
    batch_sentence = sentence * n
    individual_sentences = [sentence[0] for _ in range(n)]

    print(f"n = {n}")
    # Batch size N
    mean_batch, std_batch = measure_time("model.encode(batch_sentence)", globals(), loops, runs)
    print(f"Batch size N: {mean_batch * 1e6:.2f} µs ± {std_batch * 1e6:.2f} µs per loop (mean ± std. dev. of {runs} runs, {loops} loops each)")

    # N times batch size 1
    mean_individual, std_individual = measure_time("[model.encode(x) for x in individual_sentences]", globals(), loops, runs)
    print(f"N times batch size 1: {mean_individual * 1e6:.2f} µs ± {std_individual * 1e6:.2f} µs per loop (mean ± std. dev. of {runs} runs, {loops} loops each)\n")

For me, this gives the following output:

   n = 1
Batch size N: 65.53 µs ± 8.80 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
N times batch size 1: 66.99 µs ± 10.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

n = 2
Batch size N: 152.86 µs ± 18.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
N times batch size 1: 130.02 µs ± 5.02 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

n = 3
Batch size N: 165.88 µs ± 1.19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
N times batch size 1: 185.85 µs ± 6.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

n = 4
Batch size N: 157.36 µs ± 8.73 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
N times batch size 1: 253.65 µs ± 4.83 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

n = 5
Batch size N: 161.62 µs ± 4.49 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
N times batch size 1: 317.99 µs ± 8.00 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

n = 6
Batch size N: 199.23 µs ± 11.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
N times batch size 1: 381.24 µs ± 7.85 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

n = 7
Batch size N: 194.28 µs ± 11.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
N times batch size 1: 475.11 µs ± 17.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

n = 8
Batch size N: 203.56 µs ± 18.69 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
N times batch size 1: 548.64 µs ± 17.58 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

n = 9
Batch size N: 199.27 µs ± 4.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
N times batch size 1: 579.28 µs ± 13.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So only for N=2 there is a slight improvement when not batching, which I think is related to what @stephantul mentioned above w.r.t multiprocessing.

do-me commented 1 week ago

The effect persists, 4-5 is still the magic number on my system, don't ask me why :D Seems like there is some "base cost" in initializing the tokenizer or similar that is outweighed for n>4 but not before.

image

Test code with viz ```python import timeit import numpy as np import matplotlib.pyplot as plt def measure_time(stmt, globals, loops, runs): times = timeit.repeat(stmt, globals=globals, repeat=runs, number=loops) mean_time = np.mean(times) / loops std_dev_time = np.std(times) / loops return mean_time, std_dev_time # Number of runs and repetitions runs = 7 loops = 1000 sentence = ["I love sun bathing, but I always use sun screen!"] # Store results for plotting n_values = [] batch_means = [] individual_means = [] batch_stds = [] individual_stds = [] for n in range(1, 10): # Pre-create the input data outside the loops batch_sentence = sentence * n individual_sentences = [sentence[0] for _ in range(n)] n_values.append(n) # Batch size N mean_batch, std_batch = measure_time("model.encode(batch_sentence)", globals(), loops, runs) batch_means.append(mean_batch * 1e6) batch_stds.append(std_batch * 1e6) print(f"n = {n}") print(f"Batch size N: {mean_batch * 1e6:.2f} µs ± {std_batch * 1e6:.2f} µs per loop (mean ± std. dev. of {runs} runs, {loops} loops each)") # N times batch size 1 mean_individual, std_individual = measure_time("[model.encode(x) for x in individual_sentences]", globals(), loops, runs) individual_means.append(mean_individual * 1e6) individual_stds.append(std_individual * 1e6) print(f"N times batch size 1: {mean_individual * 1e6:.2f} µs ± {std_individual * 1e6:.2f} µs per loop (mean ± std. dev. of {runs} runs, {loops} loops each)\n") # Plotting plt.figure(figsize=(10, 6)) plt.errorbar(n_values, batch_means, yerr=batch_stds, label="Batch size N", marker='o', capsize=5) plt.errorbar(n_values, individual_means, yerr=individual_stds, label="N times batch size 1", marker='x', capsize=5) plt.xlabel("N (Number of sentences)") plt.ylabel("Time (µs)") plt.title("Performance Comparison: Batch vs. Individual") plt.legend() plt.grid(True) plt.show() ```
Results ``` n = 1 Batch size N: 73.49 µs ± 3.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) N times batch size 1: 72.22 µs ± 1.19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) n = 2 Batch size N: 248.80 µs ± 15.55 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) N times batch size 1: 144.31 µs ± 3.76 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) n = 3 Batch size N: 272.46 µs ± 14.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) N times batch size 1: 214.95 µs ± 2.76 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) n = 4 Batch size N: 291.30 µs ± 15.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) N times batch size 1: 288.21 µs ± 3.43 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) n = 5 Batch size N: 306.44 µs ± 18.01 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) N times batch size 1: 370.85 µs ± 8.99 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) n = 6 Batch size N: 335.51 µs ± 18.79 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) N times batch size 1: 445.07 µs ± 4.13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) n = 7 Batch size N: 327.37 µs ± 31.56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) N times batch size 1: 510.61 µs ± 3.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) n = 8 Batch size N: 341.84 µs ± 21.30 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) N times batch size 1: 602.72 µs ± 8.47 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) n = 9 Batch size N: 364.33 µs ± 22.03 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) N times batch size 1: 679.37 µs ± 8.14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) ```
Pringled commented 1 week ago

@do-me Interesting! I get the following result on my system:

batch_vs_individual

Not sure what causes the difference. In any case, thanks for testing this, and do let us know if you encounter any other performance quirks!

do-me commented 1 week ago

Sure thanks a lot for being so responsive, it's a joy to work with this library this way :) I do have some other questions, but will ask them in the discussions.