Open raphael-milliere opened 7 months ago
Hello!
Although the original sentence-transformers models like all-mpnet-base-v2
hold up quite well, recent community models like mxbai-embed-large-v1
should indeed outperform it. You can check for Sentence Similarity/Clustering on MTEB (and filter away >1B models, probably), and you'll get a good idea of what should work well.
You're on the right track :)
Oh, one last note: if you have some evals/tests ready for BERTopic, then you can always experiment with a few different models. They're mostly fairly small and efficient, so it should be quite simple to try out a few to get a feel for them. No leaderboard will ever beat running models on your own data.
I've been looking for up-to-date information about how various pre-trained models fare for sentence similarity and clustering tasks (e.g. with BERTopic), rather than semantic search.
According to the official pre-trained model evaluations,
all-mpnet-base-v2
is best overall, whilesentence-t5-xxl
is best for sentence similarity. However, both of these models are quite old. Surely there are better pre-trained models available for sentence similarity and/or clustering?Looking at the MTEB leaderboard,
mxbai-embed-large-v1
appears to be the leading open weights model currently. Should I expect this model to be superior toall-mpnet-base-v2
orsentence-t5-xxl
, given that I'm not constrained by compute?Thanks in advance!