Open kevinmneal opened 1 year ago
In order to generate predictions using .transform
the underlying cluster model, HDBSCAN, makes an approximation as to what the predictions will be. Since it is an approximation, it will not be exactly the same as its predictions during the fitting process. The same applies to the probabilities, since these are calculated after fitting the model and not generated as a result of the fitting process, these also can differ. You can read more about that here. To prevent these things, it might be worthwhile to use a different clustering algorithm, like k-Means, where this issue will not be present.
Hi,
When I run BERTopic using
model.fit_transform
on my dataset, it will return cluster numbers and titles that make sense for the input data. When I run the same string through the trained model usingmodel.transform
, it returns a different cluster number, though it is consistent. For example, something with "Mexican restaurant" might get assigned to topic 1 "restaurant_mexican restaurant_full service restaurant", but when I run the exact same record throughmodel.transform
, I'll get a different integer for the topic (corresponding to something totally different) and even a different probability - and these can even differ between runs ofmodel.transform
on the same trained model. Am I doing something wrong? This has been befuddling me. Note the topic numbers frommodel.get_topic_info
and those returned frommodel.fit_transform
do correspond with one another.Perhaps related, but I saw a similar frustrating difference in the outputs of
get_topic_info
and the output when usingcalculate_probabilities=True
, where the columns of the probability array did not correspond to the topic numbers fromget_topic_info
.