Fast Sentence Transformers

This repository contains code to run faster feature extractors using tools like quantization, optimization and ONNX. Just run your model much faster, while using less of memory. There is not much to it!

Phillip Schmid: "We successfully quantized our vanilla Transformers model with Hugging Face and managed to accelerate our model latency from 25.6ms to 12.3ms or 2.09x while keeping 100% of the accuracy on the stsb dataset. But I have to say that this isn't a plug and play process you can transfer to any Transformers model, task or dataset.""

Install

pip install fast-sentence-transformers

Or, for GPU support:

pip install fast-sentence-transformers[gpu]

Quickstart


from fast_sentence_transformers import FastSentenceTransformer as SentenceTransformer

# use any sentence-transformer
encoder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", device="cpu")

encoder.encode("Hello hello, hey, hello hello")
encoder.encode(["Life is too short to eat bad food!"] * 2)

Benchmark

Non-exact, indicative benchmark for speed an memory usage with smaller and larger model on sentence-transformers

model	Type	default	ONNX	ONNX+quantized	ONNX+GPU
paraphrase-albert-small-v2	memory	1x	1x	1x	1x
	speed	1x	2x	5x	20x
paraphrase-multilingual-mpnet-base-v2	memory	1x	1x	4x	4x
	speed	1x	2x	5x	20x

Shout-Out

This package heavily leans on https://www.philschmid.de/optimize-sentence-transformers.

davidberenstein1957 / fast-sentence-transformers

readme

Fast Sentence Transformers

Install

Quickstart

Benchmark

Shout-Out