MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
5.96k stars 743 forks source link

How to improve Hierarchical Clustering #1877

Open stageadss opened 5 months ago

stageadss commented 5 months ago

Hi,

After I trained my model, I wanted to make clusters on similar topic by using visualize_hierarchy but the result is not that great. The topics in each clusters are not really related. Is it because the dimension of the c_tf_idf matrix is so big ? Or is it because I used transformers to fine-tune my topic representation and so there is a high discrepancy between the topic keywords and topic representation that leads to bad Hierarchical Clustering ?

MaartenGr commented 5 months ago

It is difficult to say without knowing a bit more. Can you share your full code? Also, can you show an example of the hierarchy that was created? Lastly, which version of BERTopic are you using?

stageadss commented 5 months ago

I can't really share the code nor the topics as they are company private. But I thought about just taking the topic representations and feed them back to BERTopic and use fine-tuning , what do you think ? So that the clustering is more robust ? ( I use version 0.16.0 )

MaartenGr commented 5 months ago

It is really difficult to say without knowing specifically how you created the model, there might be something going on there. Having said that, the hierarchical modeling is done using the c-TF-IDF representations I believe but you could also use the topic embeddings instead which might help in this case.