UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.98k stars 2.45k forks source link

State-of-the-art pretrained model for sentence similarity/clustering? #2600

Open raphael-milliere opened 5 months ago

raphael-milliere commented 5 months ago

I've been looking for up-to-date information about how various pre-trained models fare for sentence similarity and clustering tasks (e.g. with BERTopic), rather than semantic search.

According to the official pre-trained model evaluations, all-mpnet-base-v2 is best overall, while sentence-t5-xxl is best for sentence similarity. However, both of these models are quite old. Surely there are better pre-trained models available for sentence similarity and/or clustering?

Looking at the MTEB leaderboard, mxbai-embed-large-v1 appears to be the leading open weights model currently. Should I expect this model to be superior to all-mpnet-base-v2 or sentence-t5-xxl, given that I'm not constrained by compute?

Thanks in advance!

tomaarsen commented 5 months ago

Hello!

Although the original sentence-transformers models like all-mpnet-base-v2 hold up quite well, recent community models like mxbai-embed-large-v1 should indeed outperform it. You can check for Sentence Similarity/Clustering on MTEB (and filter away >1B models, probably), and you'll get a good idea of what should work well.

You're on the right track :)

Oh, one last note: if you have some evals/tests ready for BERTopic, then you can always experiment with a few different models. They're mostly fairly small and efficient, so it should be quite simple to try out a few to get a feel for them. No leaderboard will ever beat running models on your own data.