MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.12k stars 763 forks source link

Hierarchical Visualization of the topics using HDBSCAN #658

Closed e-barrere closed 2 years ago

e-barrere commented 2 years ago

Hello,

Thank you for this fantastic work, Bertopic is really useful. I was wondering why is the visualization of the hierarchy based off the results of the c_tf_idf ? Since the HDBSCAN results is already a hierarchical result, why recalculate a distance representation from the c_tf_idf rather than using the hdbscan result?

Thank you

MaartenGr commented 2 years ago

The main reason for this is modularity. Although HDBSCAN is the default model, other clustering algorithms can be used instead, such as k-Means. In order to support any clustering technique, it is necessary to make this step, somewhat, independent. There is also something to say for comparing the end-result, the topic representations and too a lesser extent the clusters. That, however, might just be semantics although it does follow the philosophy of modularity as presented in the package.

e-barrere commented 2 years ago

I get it now thank you for your answer !