Closed mlinegar closed 1 year ago
In order to estimate the predicted topic probabilities in BERTopic, we need a clustering algorithm that supports that feature. There are not many out there and we often see HDBSCAN as one of the few that supports calculating probabilities. However, when you use .partial_fit
, we cannot use HDBSCAN as it does not support online learning. We often resort to an algorithm that does like Mini-Batch K-Means. However, that does not support generating probabilities and as a result. In other words, in order to calculate the probabilities in .partial_fit
, you will need to use a clustering algorithm that generates probabilities but I am not familiar with any such model.
Makes sense! I'll do some looking to be sure, but I can work without probabilities for now. Thanks very much for the quick response!
I am trying to use
partial_fit
for low memory. Is it possible to recover the (final) predicted topic probabilities? For example, by following the example in the documentation and then applyingtransform
to each document?