MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.1k stars 760 forks source link

Output the distance/correlation matrix of topics #1722

Open swl-dm opened 9 months ago

swl-dm commented 9 months ago

In the visualisation heatmap, the calculation of the correlation matrix of topics is actually very useful, e.g. for debugging purpose and as a guide to do topic reduction. Any chance it can become part of the class attribute of the bertopic or an output from calling visualize_heatmap ?

MaartenGr commented 9 months ago

That is currently not possible since that would result in changing the API of the .visualize_heatmap function. However, you can distill part of the code yourself to create the correlation matrix since it boils down to a simple cosine similarity between embeddings:

https://github.com/MaartenGr/BERTopic/blob/5c9aad22f2dbb2e5ba75653ea5e56a11528393bb/bertopic/plotting/_heatmap.py#L95

swl-dm commented 9 months ago

Many thanks Maarten. Having read your source code, I also realise it is just a few lines of code that is easy to implement outside.

On this topic, I wonder what's your view on approaches to merge topics. I am working on a medium size (~10k examples) dataset that I don't have any labels. So I am blind to the true distribution of the topics, i.e. this is unsupervised learning in my case. My current approach is to first fit a model that can produce more than enough topics, e.g. 30-50. Then I inspect the correlation matrix and heatmap to find out if there are redundant topics, e.g. those that are highly correlated to others. And then use the merge_topics() to reduce the number. Do you think this make any sense at all ? Many thanks !

MaartenGr commented 9 months ago

That sounds like a reasonable approach! You could also use .reduce_topics to reduce them automatically but whatever works best for you. Note that you can also use hierarchical topic modeling to understand which topics can potentially be merged.