MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.08k stars 757 forks source link

Elevating methods related to topic word generation to public API #1161

Closed steven-solomon closed 1 year ago

steven-solomon commented 1 year ago

Problem

My team is using BERTopic to detect topics within a dataset. We have found that it is useful to create a higher level grouping of related topics into clusters. Having fewer groups has reduced the cognitive load necessary for folks to start exploring a dataset. Along the same lines we would like to also generate top words for this new set of clustered topics.

Request

Currently, we are utilizing two private methods to generate words for our clusters of topics: _extract_words_per_topic and _c_tf_idf. However, the fact that these methods are private poses a development risk for us in the long term since they are not formally part of the API. Would you please consider supporting the contract of these methods as part of the API?

MaartenGr commented 1 year ago

Thanks for the extensive description! Just to be sure I understand correctly, if you have found some topics that can be clustered together, could you not use the .merge_topics function for that? Internally, it merges topics and creates a higher-level grouping of clusters. Moreover, you could also reduce the topics with .reduce_topics to create those higher-level grouping of clusters.

steven-solomon commented 1 year ago

@MaartenGr, thanks for the suggestion. I'll give .merge_topics a try in our pipeline and report back. We are currently doing one round of .reduce_topics as well. Essentially what we are doing is creating a really high level grouping for folks to try reason about the large themes, and then allowing them to dig into successive levels of detail.

MaartenGr commented 1 year ago

No problem! In that case, it might also be worthwhile to check out hierarchical topic modeling with BERTopic as it creates various levels of groupings. This works also rather well with the KeyBERTInspired model as it tends to give a bit more human-readable labels/keywords.