Open alegallo1511 opened 7 months ago
Which version of BERTopic are you using? Was it the same as when you saved the model?
Also, could you provide the full error message. It's not clear to me what the error is referencing.
@MaartenGr I'm getting the same error in a very similar scenario on 0.16.4. Loading using the same version as I saved the model with.
# This all works fine
loaded_models = [BERTopic.load(model_dir, embedding_model=embedding_model) for model_dir in models]
topic_model = BERTopic.merge_models(loaded_models, embedding_model=embedding_model, min_similarity=MIN_TOPIC_SIMILARITY)
print(topic_model.get_topic_info().head(25))
print('merged all models ', time.time())
# same models as i used to save the original topic models
vectorizer_model = CountVectorizer(ngram_range=(1, 3), min_df=10)
key_mmr = [KeyBERTInspired(top_n_words=10, random_state=42), MaximalMarginalRelevance(diversity=0.5)] # chain
mmr_model = MaximalMarginalRelevance(diversity=0.3)
representation_model = {
"mmr": mmr_model,
"keymmr": key_mmr
}
# Error occurs here
new_topics = topic_model.reduce_outliers(docs, topic_model.topics_)
Here's the full error message:
0%| | 0/1171 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/redacted/merge_monthly_bertopic.py", line 85, in <module>
new_topics = topic_model.reduce_outliers(docs, topic_model.topics_)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/redacted/env/lib/python3.11/site-packages/bertopic/_bertopic.py", line 2330, in reduce_outliers
topic_distr, _ = self.approximate_distribution(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/redacted/env/lib/python3.11/site-packages/bertopic/_bertopic.py", line 1341, in approximate_distribution
bow_doc = self.vectorizer_model.transform(all_sentences)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/redacted/env/lib/python3.11/site-packages/sklearn/feature_extraction/text.py", line 1414, in transform
self._check_vocabulary()
File "/redacted/env/lib/python3.11/site-packages/sklearn/feature_extraction/text.py", line 505, in _check_vocabulary
raise NotFittedError("Vocabulary not fitted or provided")
sklearn.exceptions.NotFittedError: Vocabulary not fitted or provided
@NullPxl When you merge models, their c-TF-IDF are currently not merged since they have different distributions (or potentially different vocabularies). Therefore, when you reduce outliers, it cannot make use of c-TF-IDF. You will instead either have to recalculate the c-TF-IDF representations with .update_topics
or you will have to use embeddings when reducing outliers (see its params).
Hi!
A month ago I created a topic model and saved it as follows:
topic_model.save(outpath, serialization="safetensors")
.I then reduced the outliers in the model,
new_topics = topic_model.reduce_outliers(docs, topics)
, and used it in an empirical analysis, but I did not save the model with the updated topics.I now want to produce visualizations of the topics used in the analysis so I have loaded my dataframe (and defined docs again), loaded the model and tried to reduce the outliers again, but I get an error and I am not sure how to fix it. The code and error are below:
_sklearn.exceptions.NotFittedError: Vocabulary not fitted or provided__
I have also tried using
, but I got the same error.
Any help in how to fix this would be greatly appreciated.
Thanks in advance for your time!