Probabilities from fit_transform and transform are different

MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.

MIT License

6.01k stars 752 forks source link

Using bertopic==0.16.0 on a MacOS M1 machine, I have found some very strange behavior for the probabilities for each topic.

dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]
docs = dataset["abstract"][:5000]
bertopic = BERTopic(
    embedding_model=SentenceTransformer("all-MiniLM-L6-v2"), 
    calculate_probabilities=True
)
embeddings = bertopic.embedding_model.encode(docs, show_progress_bar=True)
y_pred, y_prob = bertopic.fit_transform(docs, embeddings)
y_pred_transform,  y_prob_transform = bertopic.transform(docs, embeddings)

First of all, probabilities don't add up to 100% (addressed in #500), apparently because it does not account for not belonging ot any topic, which I guess is fine.

y_prob.sum(axis=1) --> [0.458, 0.386, 0.607, ..., 1., 0.927, 0.350]

However, I have noticed that fit_transform and transform don't return the same probabilities, which is obviously a concern. Which one should we trust then? Weirdly the sum of the probabilities are the same with both methods. The discrepancy between the 2 methods comes from the fact that fit_transform and transform are defined independently and do different things (typically, when using the sklearn API format, one defines the fit and transform methods and gets the fit_transform method for free, which ensures consistency between fit_transform and transform predictions).

np.allclose(y_pred, y_pred_transform) --> True
np.allclose(y_prob, y_prob_transform) --> False
np.allclose(y_prob.sum(axis=1), y_prob_transform.sum(axis=1)) --> True

What's even more surprising is that the predicted topic is not the topic with the highest probability. Another issue (#1024) raised this issue but it seemed to have been corrected in v0.14.1. The values only match for 71% of the documents in my case:

np.allclose(y_pred, np.argmax(y_prob, axis=1)) --> False
np.allclose(y_pred_transform, np.argmax(y_prob_transform, axis=1)) --> False
y_pred --> [70, 52,  6, ...,  1, -1, -1]
np.argmax(y_prob, axis=1) --> [65, 52,  6, ...,  1, 75, 12]
(np.argmax(y_prob, axis=1) == np.array(y_pred)).mean() --> 0.7128

So, given all these inconsistencies, should we trust probabilities at all?

First of all, probabilities don't add up to 100% (addressed in https://github.com/MaartenGr/BERTopic/issues/500), apparently because it does not account for not belonging ot any topic, which I guess is fine.

That's correct, it is the underlying procedure of HDBSCAN which I try to stay true to as much as possible considering BERTopic is a modular framework.

However, I have noticed that fit_transform and transform don't return the same probabilities, which is obviously a concern. Which one should we trust then? Weirdly the sum of the probabilities are the same with both methods. The discrepancy between the 2 methods comes from the fact that fit_transform and transform are defined independently and do different things (typically, when using the sklearn API format, one defines the fit and transform methods and gets the fit_transform method for free, which ensures consistency between fit_transform and transform predictions).

(Un)fortunately, it is not that straightforward to simply use the sklearn API to get the same definitions. There are several reasons for this that relate to the underlying pipeline/algorithm. Often, clustering models do not have a transform/predict function, so having something that does that independently is often necessary. Take HDBSCAN for example, it uses a different method for fitting and predicting instances of data points. These can give very different results since HDBSCAN calculates the probabilities after doing the clustering assignment. As such, it is an approximation. If you were to use a different algorithm, like k-Means, the results would remain the same. This is inherent to a modular framework as you cannot control for all possible algorithms. Moreover, there is an option in BERTopic to perform efficient saving (and hosting on Hugging Face) by removing the necessity of using the underlying dimensionality reduction and clustering algorithms for inference, you can perform inference without them. As such, the results would be different but come with several advantages.

What's even more surprising is that the predicted topic is not the topic with the highest probability. Another issue (https://github.com/MaartenGr/BERTopic/issues/1024) raised this issue but it seemed to have been corrected in v0.14.1. The values only match for 71% of the documents in my case:

This is not unsurprising since this relates to the underlying clustering algorithm, HDBSCAN, which makes an approximation of probability calculation and is not an inherent process of the clustering itself. As mentioned, you would have to use a different algorithm if you want to fix the results.

So, given all these inconsistencies, should we trust probabilities at all?

That is not for me to say since it depends heavily on your use case. Did you perform validation of the probabilities with respect to the assignments? Do you want fast inference or do you want to make use of the underlying fitting process? Do you want outliers to be modeled or not?

One thing I will mention is that I do not believe it is as black or white as you make it out to be due to the above reasons and since it depends on the choice of underlying algorithms.

MaartenGr / BERTopic

Probabilities from fit_transform and transform are different #1831