UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.81k stars 2.43k forks source link

speed-up recommendations of cross-encoder on a CPU #2482

Open krumeto opened 6 months ago

krumeto commented 6 months ago

Hey team,

I am looking for ways to speed a cross-encoder (cross-encoder/ms-marco-MiniLM-L-6-v2) on a CPU. For a variety of reasons, GPUs are off-the-table for now.

We have the following setup:

The model gets a short query and a list of 60 - 80 texts, typically above the 512 max_tokens (getting truncated).

Do you have any recommendations for speed-up?

Example ideas that we had:

  1. Get bigger CPUs and increase the batch_size from the default of 32 to our expected texts (for example, 80)
  2. Potentially try onnx - for example, based on https://www.philschmid.de/optimize-sentence-transformers or this model - https://huggingface.co/metarank/ce-msmarco-MiniLM-L6-v2
  3. Perhaps something like multi-cpu-reranking? (https://sbert.net/examples/applications/computing-embeddings/README.html#multi-process-multi-gpu-encoding)

Any ideas would be more than welcome. Thank you in advance!

tomaarsen commented 6 months ago

Hello!

ONNX might be worth a shot:

CrossEncoder via transformers


from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import time

model = AutoModelForSequenceClassification.from_pretrained("cross-encoder/ms-marco-MiniLM-L-6-v2")
tokenizer = AutoTokenizer.from_pretrained("cross-encoder/ms-marco-MiniLM-L-6-v2")

# 80 texts:
texts = [
    ["How many people live in Berlin?", "How many people live in Berlin?"],
    [
        "Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.",
        "New York City is famous for the Metropolitan Museum of Art.",
    ],
] * 40

warmup_features = tokenizer(texts[:10], padding=True, return_tensors="pt")
features = tokenizer(texts, padding=True, return_tensors="pt")

with torch.no_grad():
    scores = model(**warmup_features).logits

    start_t = time.time()
    scores = model(**features).logits
    print(
        f"Time: {len(texts) / (time.time() - start_t):.4f} sentences per second"
    )
Time: 601.4126 sentences per second

ONNX:

Export via:

optimum-cli export onnx -m cross-encoder/ms-marco-MiniLM-L-6-v2 ce-ms-marco-MiniLM-L-6-v2

# Export with:
# optimum-cli export onnx -m cross-encoder/ms-marco-MiniLM-L-6-v2 ce-ms-marco-MiniLM-L-6-v2

from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForSequenceClassification
import torch
import time

model = ORTModelForSequenceClassification.from_pretrained("ce-ms-marco-MiniLM-L-6-v2")
tokenizer = AutoTokenizer.from_pretrained("ce-ms-marco-MiniLM-L-6-v2")

# 80 texts:
texts = [
    ["How many people live in Berlin?", "How many people live in Berlin?"],
    [
        "Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.",
        "New York City is famous for the Metropolitan Museum of Art.",
    ],
] * 40

warmup_features = tokenizer(texts[:10], padding=True, return_tensors="pt")
features = tokenizer(texts, padding=True, return_tensors="pt")

with torch.no_grad():
    scores = model(**warmup_features).logits

    start_t = time.time()
    scores = model(**features).logits
    print(
        f"Time: {len(texts) / (time.time() - start_t):.4f} sentences per second"
    )
Time: 905.0249 sentences per second

or

optimum-cli export onnx -m cross-encoder/ms-marco-MiniLM-L-6-v2 ce-ms-marco-MiniLM-L-6-v2 --optimize O3

with the same script:

Time: 981.4480 sentences per second

Beyond that, multi-CPU reranking definitely should also work, but you'd have to implement it yourself manually. However, I think it might not fit your use case well, as it's primarily just for very large datasets (it's slow to make the pools).

Increasing the batch size might also work, indeed; but I'm not sure if that works well with CPU (it certainly does with GPU).

Another consideration is using the Intel Neural Compressor, e.g. with optimum-intel, but I am not very aware of this & I'm not sure if it would work on the CPU that you mentioned.

krumeto commented 6 months ago

@tomaarsen We managed to speed-up the CrossEncoder on our CPUs significantly. Reporting below, in case similar questions appear in the future. Feel free to close this one.

  1. Increased the Fargate CPU units twice from 4k to 8k. This has around x2 speedup by itself; Adding more CPU units hits diminishing returns in our setup.
  2. Lower the cross encoder max token size from 512 to 352. Our average tokenised document is between 350 and 400 tokens. Even for larger documents, we observed almost identical scores, but a significant speed-up. Takeaway for us would be to aim for the 75th percentile of length (rather than the max/95th perc.) when deciding on max token as default and/or compare scores vs. max token length to understand speed/quality trade-off. We observed that the ranking of the top 7 documents barely changed and scores we very similar for the reduced max token length.
  3. Increased batch size to 64 (our usual batch of documents being around 60) on Fargate, but lowered it for local experiments to 16. Lower batch sizes tended to be faster on local CPUs and slower on our Fargate setup.

Biggest portion of the work done by @pgergov

tomaarsen commented 6 months ago

Awesome! Thanks for sharing, and great job @pgergov. Out of curiosity, have you also tried BAAI/bge-reranker-base, BAAI/bge-reranker-large and/or the very new mixedbread-ai/mxbai-rerank-xsmall-v1, mixedbread-ai/mxbai-rerank-base-v1 and/or mixedbread-ai/mxbai-rerank-large-v1 models?

They might improve your reranking performance, and the mixedbread xsmall one might only be slightly slower than L6.

krumeto commented 6 months ago

We tried bge-reranker-base and it was an improvement in quality, but prohibitively slow on a CPU. I tried mxbai-rerank-xsmall-v1 this morning - it is roughly two times slower - not on the table for now, but might be a very nice quality/speed compromise. Once we have a GPU-active setup, all of those will be on the table.

snayan06 commented 4 months ago

hi so I was experimenting with the cross encoder from Hugging Face, specifically the one at https://huggingface.co/mixedbread-ai/mxbai-rerank-xsmall-v1, When I ran it on a CPU in Google Colab, the latency was around 400 milliseconds using the %%timeit I was checking it . However, when I tried the same setup on my dev environment, which is a Linux server with 4 cores and 16 GB of RAM, the performance wasn't even close.

I'm curious about how others benchmark this type of model and host it. I wonder if I'm missing something he script I'm using is pretty straightforward, and even for a single request at a time, it's taking more time

So, how do people typically host this kind of model on a CPU, and am I missing something ??? ( if u can point me to some resource also that would be highly appreciated. )

snayan06 commented 4 months ago

u can check this was my script ,
https://gist.github.com/snayan06/0e1ae7e9999c78a8b4533a7ed570f69d this was some of the result which we had noted down , but thinking there should be a better way to do this 🙈 😓

Screenshot 2024-04-30 at 12 22 23 AM