MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.12k stars 763 forks source link

Estimating predicted topic probabilities using partial_fit #815

Closed mlinegar closed 1 year ago

mlinegar commented 2 years ago

I am trying to use partial_fit for low memory. Is it possible to recover the (final) predicted topic probabilities? For example, by following the example in the documentation and then applying transform to each document?

topics = []
for docs in doc_chunks:
    topic_model.partial_fit(docs)
    topics.extend(topic_model.topics_)

topic_model.topics_ = topics

for docs in doc_chunks:
    final_topics, final_probs = topic_model.transform(docs)
MaartenGr commented 2 years ago

In order to estimate the predicted topic probabilities in BERTopic, we need a clustering algorithm that supports that feature. There are not many out there and we often see HDBSCAN as one of the few that supports calculating probabilities. However, when you use .partial_fit, we cannot use HDBSCAN as it does not support online learning. We often resort to an algorithm that does like Mini-Batch K-Means. However, that does not support generating probabilities and as a result. In other words, in order to calculate the probabilities in .partial_fit, you will need to use a clustering algorithm that generates probabilities but I am not familiar with any such model.

mlinegar commented 1 year ago

Makes sense! I'll do some looking to be sure, but I can work without probabilities for now. Thanks very much for the quick response!