MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.12k stars 763 forks source link

Topics returned from model.transform differ from model.fit_transform #905

Open kevinmneal opened 1 year ago

kevinmneal commented 1 year ago

Hi,

When I run BERTopic using model.fit_transform on my dataset, it will return cluster numbers and titles that make sense for the input data. When I run the same string through the trained model using model.transform, it returns a different cluster number, though it is consistent. For example, something with "Mexican restaurant" might get assigned to topic 1 "restaurant_mexican restaurant_full service restaurant", but when I run the exact same record through model.transform, I'll get a different integer for the topic (corresponding to something totally different) and even a different probability - and these can even differ between runs of model.transform on the same trained model. Am I doing something wrong? This has been befuddling me. Note the topic numbers from model.get_topic_info and those returned from model.fit_transform do correspond with one another.

Perhaps related, but I saw a similar frustrating difference in the outputs of get_topic_info and the output when using calculate_probabilities=True, where the columns of the probability array did not correspond to the topic numbers from get_topic_info.

MaartenGr commented 1 year ago

In order to generate predictions using .transform the underlying cluster model, HDBSCAN, makes an approximation as to what the predictions will be. Since it is an approximation, it will not be exactly the same as its predictions during the fitting process. The same applies to the probabilities, since these are calculated after fitting the model and not generated as a result of the fitting process, these also can differ. You can read more about that here. To prevent these things, it might be worthwhile to use a different clustering algorithm, like k-Means, where this issue will not be present.