State-of-the-art pretrained model for sentence similarity/clustering?

UKPLab / sentence-transformers

State-of-the-Art Text Embeddings

Apache License 2.0

14.98k stars 2.45k forks source link

I've been looking for up-to-date information about how various pre-trained models fare for sentence similarity and clustering tasks (e.g. with BERTopic), rather than semantic search.

According to the official pre-trained model evaluations, all-mpnet-base-v2 is best overall, while sentence-t5-xxl is best for sentence similarity. However, both of these models are quite old. Surely there are better pre-trained models available for sentence similarity and/or clustering?

Looking at the MTEB leaderboard, mxbai-embed-large-v1 appears to be the leading open weights model currently. Should I expect this model to be superior to all-mpnet-base-v2 or sentence-t5-xxl, given that I'm not constrained by compute?

Thanks in advance!

UKPLab / sentence-transformers

State-of-the-art pretrained model for sentence similarity/clustering? #2600