MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.12k stars 763 forks source link

Heatmap visualization is shifted because of outliers topic #782

Closed aholovenko closed 1 year ago

aholovenko commented 2 years ago

I was generating the heatmap using self.model.visualize_heatmap() method and have noticed that the visualization doesn't match the distance values. I think, the issue is that you are including -1 topic when getting self.topic_embeddings_ code, but remove it during the vusualize_heatmap() function

Could you check, please? https://github.com/MaartenGr/BERTopic/blob/09c1732997f838050c263ad00ad3c9474e816863/bertopic/plotting/_heatmap.py#L93 I guest this provides correct results.

    embeddings = embeddings[1:][indices]
MaartenGr commented 2 years ago

Thank you for sharing this. I will have to look into this a bit more since the shift may or may not happen depending on whether HDBSCAN is used or another clustering algorithm that does not have -1 in its possible classes.

aholovenko commented 2 years ago

Thanks @MaartenGr