MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.05k stars 756 forks source link

should we reduce the dimensionality of topic_model.topic_embeddings_ ? #1959

Open Batchounet opened 5 months ago

Batchounet commented 5 months ago

Dear Creator of the amazing BERTopic

I want to perform cosine similarity of the topic_embeddings to a list of labels. I found it to perform better than zeroshot (and faster !) for my use case. However, the embeddings in topic_model.topicembeddings are 384 dimensional vectors, ie their dimension is not reduced using hdbscan. To my understanding, the cosine similarity could suffer from the curse of dimensionality because of that. Actually, ploting the max cosine similarity to my list of labels might suggest that, making most topics cosine similar to my labels to 0. 55 : cosine score webpage

Should I add a dimensionality reduction step ? would it be possible to use directly the reduced embeddings for the topic_model ?

Again thanks for your work

MaartenGr commented 5 months ago

To my understanding, the cosine similarity could suffer from the curse of dimensionality because of that.

Actually, that's not entirely the case. Sure the curse of dimensionality has some influence but that is generally much less compared to other distance measures, like euclidean distance. There's a reason why you see cosine similarity (alongside dot product) used in embedding-based computations and that's because these distance measures work so well.

For the "highest precision" I would advise not reducing the dimensionality of the embeddings when using plain cosine similarity.

However, the embeddings in topic_model.topicembeddings are 384 dimensional vectors, ie their dimension is not reduced using hdbscan.

Note that it's UMAP reducing the embeddings, not HDBSCAN.

Batchounet commented 5 months ago

Thank you very much. Yes, I meant UMAP.