Open swl-dm opened 9 months ago
That is currently not possible since that would result in changing the API of the .visualize_heatmap
function. However, you can distill part of the code yourself to create the correlation matrix since it boils down to a simple cosine similarity between embeddings:
Many thanks Maarten. Having read your source code, I also realise it is just a few lines of code that is easy to implement outside.
On this topic, I wonder what's your view on approaches to merge topics. I am working on a medium size (~10k examples) dataset that I don't have any labels. So I am blind to the true distribution of the topics, i.e. this is unsupervised learning in my case. My current approach is to first fit a model that can produce more than enough topics, e.g. 30-50. Then I inspect the correlation matrix and heatmap to find out if there are redundant topics, e.g. those that are highly correlated to others. And then use the merge_topics() to reduce the number. Do you think this make any sense at all ? Many thanks !
That sounds like a reasonable approach! You could also use .reduce_topics
to reduce them automatically but whatever works best for you. Note that you can also use hierarchical topic modeling to understand which topics can potentially be merged.
In the visualisation heatmap, the calculation of the correlation matrix of topics is actually very useful, e.g. for debugging purpose and as a guide to do topic reduction. Any chance it can become part of the class attribute of the bertopic or an output from calling visualize_heatmap ?