embeddings-benchmark / mteb

MTEB: Massive Text Embedding Benchmark
https://arxiv.org/abs/2210.07316
Apache License 2.0
1.79k stars 237 forks source link

How does MTEB handle truncation in embedding models? #377

Closed rasdani closed 4 days ago

rasdani commented 5 months ago

Embedding models have differing context lengths. intfloat/multilingual-e5-base for example has an input limit of 512 tokens. After that it just truncates the text.

Texts in MTEB's datasets are of differing length, too. Is MTEB in any way aware of the embedding's token limit?

I am working on a dataset and was considering chunking it and contemplating how to do it properly.

KennethEnevoldsen commented 5 months ago

@rasdani I believe that is up to the model to handle.

The benchmark simply hands off the full-length texts (list[str]) to the model and then the model (typically implemented using SentenceTransformers) truncates the text. However, the model could essentially implement the encode method in any way they like, with their own custom method for dealing with documents that are too long.

I might be missing some implementation details, but I believe @Muennighoff can fill in the blanks here.

Muennighoff commented 5 months ago

My understanding is the same as that of @KennethEnevoldsen ! Great explanation!

KennethEnevoldsen commented 4 days ago

This issue seem to have been answered - will close it