MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.19k stars 765 forks source link

Best-performing embedding models? #1931

Open raphael-milliere opened 7 months ago

raphael-milliere commented 7 months ago

I've been looking for up-to-date information about how various pre-trained models compare for clustering and topic modeling with BERTopic – rather than semantic search which is all the rage these days with RAG pipelines.

According to the official pre-trained model evaluations, all-mpnet-base-v2 is best overall, while sentence-t5-xxl is best for sentence similarity. However, both of these models are quite old. Surely there are better pre-trained models available for similarity/clustering?

Looking at the MTEB leaderboard, mxbai-embed-large-v1 appears to be the leading open weights model currently. Should I expect this model to be superior to all-mpnet-base-v2 or sentence-t5-xxl for BERTopic? I've done some informal tests, but I'm not convinced it results in better topics.

MaartenGr commented 7 months ago

I would indeed advise looking at the MTEB leaderboard and specifically look at the clustering metric since that is what BERTopic is using mostly. In my experience the clusters are formed a bit better when using a model that scores higher on the leaderboard.

However, do note that small differences in clusters might not affect the topic representations that greatly if you have a relatively big dataset. You might see differences in smaller clusters but it will unlikely affect those larger clusters that already have good representations.

aramis-it commented 6 months ago

@raphael-milliere did you find anything?