UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.81k stars 2.43k forks source link

Inference speed difference when using sentence-transformers (python) and candle (rust) #2897

Open AbhishekBose opened 3 weeks ago

AbhishekBose commented 3 weeks ago

Here is my candle implementation: (Taken from the examples itself)

`pub fn encode(&self, prompt: &str) -> Result<(Tensor,Tensor)> {

    let tokens = self.tokenizer
        .encode(prompt, true)
        .map_err(E::msg)?
        .get_ids()
        .to_vec();
    let token_ids = Tensor::new(&tokens[..],&self.device )?.unsqueeze(0)?;
    let token_type_ids = token_ids.zeros_like()?;
    let embeddings =self.model.forward(&token_ids, &token_type_ids)?;
    Ok((embeddings,token_ids))
}`

and here is the python implementation

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2') embeddings = model.encode([request.query])

On comparing just the encoding times, this is what I get

Encoding time for rust:

Time taken for encoding: 57.054333ms
Time taken for encoding: 59.913916ms
Time taken for encoding: 55.118625ms
Time taken for encoding: 51.580917ms
Time taken for encoding: 60.823625ms
Time taken for encoding: 56.318333ms
Time taken for encoding: 52.357875ms
Time taken for encoding: 82.0645ms
Time taken for encoding: 52.349709ms
Time taken for encoding: 63.768209ms
Time taken for encoding: 55.508666ms

Encoding time for python:

Time taken for encoding: 33.95 ms
Time taken for encoding: 124.68 ms
Time taken for encoding: 54.5 ms
Time taken for encoding: 30.46 ms
Time taken for encoding: 20.73 ms
Time taken for encoding: 26.07 ms
Time taken for encoding: 37.49 ms
Time taken for encoding: 24.42 ms
Time taken for encoding: 36.08 ms
Time taken for encoding: 24.55 ms
Time taken for encoding: 36.13 ms
Time taken for encoding: 29.97 ms
Time taken for encoding: 35.69 ms
Time taken for encoding: 26.8 ms
Time taken for encoding: 31.32 ms
Time taken for encoding: 30.12 ms
Time taken for encoding: 32.37 ms
Time taken for encoding: 34.27 ms
Time taken for encoding: 31.85 ms
Time taken for encoding: 35.78 ms
Time taken for encoding: 44.09 ms
Time taken for encoding: 19.15 ms
Time taken for encoding: 23.83 ms
Time taken for encoding: 33.09 ms
Time taken for encoding: 31.65 ms

What could be the reason for the difference in inference time between the two? I was under the impression that sentence-transformers uses rust's bindings

The experiment was run on an m1 mac air

tomaarsen commented 3 weeks ago

Hello!

I was under the impression that sentence-transformers uses rust's bindings

Sentence Transformers relies on torch instead to run its models. That would cause the discrepancy.

Having said that, on GPUs the candle-based runtime should be quicker, see https://github.com/huggingface/text-embeddings-inference which uses Candle. Perhaps that project has quicker runtimes because of other optimizations (like Flash Attention), or perhaps because it tests on larger models like https://huggingface.co/BAAI/bge-base-en-v1.5?

AbhishekBose commented 3 weeks ago

Flash attention is a possibility. Gotta check that part Although I did the test using the same model, which was 'sentence-transformers/all-MiniLM-L6-v2'