MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.2k stars 767 forks source link

Reducing Outliers of Loaded Model #1944

Open alegallo1511 opened 7 months ago

alegallo1511 commented 7 months ago

Hi!

A month ago I created a topic model and saved it as follows: topic_model.save(outpath, serialization="safetensors").

I then reduced the outliers in the model, new_topics = topic_model.reduce_outliers(docs, topics), and used it in an empirical analysis, but I did not save the model with the updated topics.

I now want to produce visualizations of the topics used in the analysis so I have loaded my dataframe (and defined docs again), loaded the model and tried to reduce the outliers again, but I get an error and I am not sure how to fix it. The code and error are below:

loaded_model = BERTopic.load("Only-English-BERT-topic-meaning-min-size-50")
topics = loaded_model.topics_
new_topics = loaded_model.reduce_outliers(docs, topics)

_sklearn.exceptions.NotFittedError: Vocabulary not fitted or provided__

I have also tried using

`topics, probs = loaded_model.transform(docs)

, but I got the same error.

Any help in how to fix this would be greatly appreciated.

Thanks in advance for your time!

MaartenGr commented 7 months ago

Which version of BERTopic are you using? Was it the same as when you saved the model?

Also, could you provide the full error message. It's not clear to me what the error is referencing.

NullPxl commented 2 days ago

@MaartenGr I'm getting the same error in a very similar scenario on 0.16.4. Loading using the same version as I saved the model with.

# This all works fine
loaded_models = [BERTopic.load(model_dir, embedding_model=embedding_model) for model_dir in models]
topic_model = BERTopic.merge_models(loaded_models, embedding_model=embedding_model, min_similarity=MIN_TOPIC_SIMILARITY)
print(topic_model.get_topic_info().head(25))
print('merged all models ', time.time())

# same models as i used to save the original topic models
vectorizer_model = CountVectorizer(ngram_range=(1, 3), min_df=10)
key_mmr = [KeyBERTInspired(top_n_words=10, random_state=42), MaximalMarginalRelevance(diversity=0.5)] # chain
mmr_model = MaximalMarginalRelevance(diversity=0.3)
representation_model = {
"mmr":  mmr_model,
"keymmr": key_mmr
}

# Error occurs here
new_topics = topic_model.reduce_outliers(docs, topic_model.topics_)

Here's the full error message:

  0%|          | 0/1171 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/redacted/merge_monthly_bertopic.py", line 85, in <module>
    new_topics = topic_model.reduce_outliers(docs, topic_model.topics_)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/redacted/env/lib/python3.11/site-packages/bertopic/_bertopic.py", line 2330, in reduce_outliers
    topic_distr, _ = self.approximate_distribution(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/redacted/env/lib/python3.11/site-packages/bertopic/_bertopic.py", line 1341, in approximate_distribution
    bow_doc = self.vectorizer_model.transform(all_sentences)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/redacted/env/lib/python3.11/site-packages/sklearn/feature_extraction/text.py", line 1414, in transform
    self._check_vocabulary()
  File "/redacted/env/lib/python3.11/site-packages/sklearn/feature_extraction/text.py", line 505, in _check_vocabulary
    raise NotFittedError("Vocabulary not fitted or provided")
sklearn.exceptions.NotFittedError: Vocabulary not fitted or provided
MaartenGr commented 1 day ago

@NullPxl When you merge models, their c-TF-IDF are currently not merged since they have different distributions (or potentially different vocabularies). Therefore, when you reduce outliers, it cannot make use of c-TF-IDF. You will instead either have to recalculate the c-TF-IDF representations with .update_topics or you will have to use embeddings when reducing outliers (see its params).