Closed rasdani closed 4 days ago
@rasdani I believe that is up to the model to handle.
The benchmark simply hands off the full-length texts (list[str]
) to the model and then the model (typically implemented using SentenceTransformers) truncates the text. However, the model could essentially implement the encode
method in any way they like, with their own custom method for dealing with documents that are too long.
I might be missing some implementation details, but I believe @Muennighoff can fill in the blanks here.
My understanding is the same as that of @KennethEnevoldsen ! Great explanation!
This issue seem to have been answered - will close it
Embedding models have differing context lengths.
intfloat/multilingual-e5-base
for example has an input limit of 512 tokens. After that it just truncates the text.Texts in MTEB's datasets are of differing length, too. Is MTEB in any way aware of the embedding's token limit?
I am working on a dataset and was considering chunking it and contemplating how to do it properly.