Open raphael-milliere opened 7 months ago
I would indeed advise looking at the MTEB leaderboard and specifically look at the clustering
metric since that is what BERTopic is using mostly. In my experience the clusters are formed a bit better when using a model that scores higher on the leaderboard.
However, do note that small differences in clusters might not affect the topic representations that greatly if you have a relatively big dataset. You might see differences in smaller clusters but it will unlikely affect those larger clusters that already have good representations.
@raphael-milliere did you find anything?
I've been looking for up-to-date information about how various pre-trained models compare for clustering and topic modeling with BERTopic – rather than semantic search which is all the rage these days with RAG pipelines.
According to the official pre-trained model evaluations, all-mpnet-base-v2 is best overall, while sentence-t5-xxl is best for sentence similarity. However, both of these models are quite old. Surely there are better pre-trained models available for similarity/clustering?
Looking at the MTEB leaderboard, mxbai-embed-large-v1 appears to be the leading open weights model currently. Should I expect this model to be superior to all-mpnet-base-v2 or sentence-t5-xxl for BERTopic? I've done some informal tests, but I'm not convinced it results in better topics.