MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.08k stars 757 forks source link

Predicted probabilities inconsistency and questions about saving and loading model #1482

Open lysummer55 opened 1 year ago

lysummer55 commented 1 year ago

Hi I have two questions when I use a HDBSCAN model with BERTopic.

  1. When I try to predict a new sentence (I want the probability of the outlier as well so I am using HDBSCAN for prediction instead of approximiate_distribution function), I try to use topic_model.transform(text_input[0]) on my first training text but it gives me a different probability distribution from the probabilities probs[0] for the same sentence outputted by topics, probs = topic_model.fit_transform(text_input) when I fit the model/predict with the calculate_probabilities =True. Shouldn't they be the same as they are using the same trained HDBSCAN model for prediction? Why is this inconsistency from?

  2. I have my own sentence transformer model (say that it is a fenetuned SBERT or wrapper of the BERT CLS token into sentence transformer class). If i want to keep my HDBSCAN and embedding model for inference when i reload my saved model, I can only save it as pickle format right? As per my understanding, the pytorch and safetensors saving options are only supporting huggingface factory sentence transformer (by re-downloading) instead of a customized or local saved sentence transformer and it will not save the HDBSCAN model, is that correct? Any lighter option (smaller saving size) for my purpose?

Thank you!

MaartenGr commented 1 year ago

When I try to predict a new sentence (I want the probability of the outlier as well so I am using HDBSCAN for prediction instead of approximiate_distribution function), I try to use topic_model.transform(text_input[0]) on my first training text but it gives me a different probability distribution from the probabilities probs[0] for the same sentence outputted by topics, probs = topic_model.fit_transform(text_input) when I fit the model/predict with the calculate_probabilities =True. Shouldn't they be the same as they are using the same trained HDBSCAN model for prediction? Why is this inconsistency from?

This has to do with how HDBSCAN approximates the probabilities. The keyword here lies in approximation as the process of extracting probabilities is not the main method for the actual assignment of clusters. You can find more about that here.

I have my own sentence transformer model (say that it is a fenetuned SBERT or wrapper of the BERT CLS token into sentence transformer class). If i want to keep my HDBSCAN and embedding model for inference when i reload my saved model, I can only save it as pickle format right?

No, you can save your model as safetensors or pytorch and then when loading in your model, simply use the embedding_model to pass your custom embedding model. You can find more about that here.

As per my understanding, the pytorch and safetensors saving options are only supporting huggingface factory sentence transformer (by re-downloading) instead of a customized or local saved sentence transformer and it will not save the HDBSCAN model, is that correct?

These methods support having a pointer towards a huggingface-hosted model to be loaded in with sentence-transformers. That generally means many of the models on the MTEB leaderboard. It will indeed only save the pointer and not save the HDBSCAN model or the UMAP model.

Any lighter option (smaller saving size) for my purpose?

See above.

lysummer55 commented 1 year ago

_When I try to predict a new sentence (I want the probability of the outlier as well so I am using HDBSCAN for prediction instead of approximiate_distribution function), I try to use topic_model.transform(text_input[0]) on my first training text but it gives me a different probability distribution from the probabilities probs[0] for the same sentence outputted by topics, probs = topic_model.fit_transform(text_input) when I fit the model/predict with the calculate_probabilities =True. Shouldn't they be the same as they are using the same trained HDBSCAN model for prediction? Why is this inconsistency from?

This has to do with how HDBSCAN approximates the probabilities. The keyword here lies in approximation as the process of extracting probabilities is not the main method for the actual assignment of clusters. You can find more about that here._

Hi Thanks for the reply. I read through the details, posts and codes. My question is that when I do topics, probs = topic_model.fit_transform(text_input) where topic_model.calculateprobabilities=True, and when I do inference with cluster, pred_probs = topic_model.transform(text_input), shouldn't probs[0] be the same as pred_probs[0] for text sample text_input[0]? Underlying, topic_model.fit_transform is calling hdbscan.prediction.all_points_membership_vectors while topic_model.transform is calling hdbscan.prediction.membership_vector when calculateprobabilities=True. But both of the functions should give the probabilities for the soft cluster membership. But why they give different values for the same text (first training text sample) ? Could you help to clarify a little bit more? I feel lost as the approximate_distribution is not used here at all.

Thank you!

MaartenGr commented 1 year ago

No, all_points_membership_vectors works a bit differently from membership_vector and does a slightly different calculation. It calculates differently on training data vs. inference where it expects different data. For specifics, you would have to dive into those functions yourself or perhaps use the HDBSCAN issues page. I believe there are a number of issues discussing this phenomenon.

Could you help to clarify a little bit more? I feel lost as the approximate_distribution is not used here at all.

The approximate_distribution function in BERTopic was created as an alternative to the HDBSCAN probability creation. I would advise using that as the underlying method is a bit more straightforward.