How to do Hierarchical Topic Modeling on Merged Model?

shivamtawari commented 2 months ago

Have you searched existing issues? 🔎

[X] I have searched and found no existing issues

Desribe the bug

Hi @MaartenGr I am trying to create visualization of hierarchical topic modeling on two topic models merged using .merge_models.

hierarchical_topics_merged = merged_model.hierarchical_topics(docs_1+docs_2)

It produces the following error:

2024-09-16 09:47:47,878 - BERTopic - WARNING: No c-TF-IDF matrix was found despite it is supposed to be used (`use_ctfidf` is True). Defaulting to semantic embeddings.
---------------------------------------------------------------------------
NotFittedError                            Traceback (most recent call last)
[<ipython-input-24-5238a0008058>](https://localhost:8080/#) in <cell line: 1>()
----> 1 hierarchical_topics_merged = merged_model.hierarchical_topics(docs_3)

2 frames
[/usr/local/lib/python3.10/dist-packages/sklearn/feature_extraction/text.py](https://localhost:8080/#) in _check_vocabulary(self)
    506             self._validate_vocabulary()
    507             if not self.fixed_vocabulary_:
--> 508                 raise NotFittedError("Vocabulary not fitted or provided")
    509 
    510         if len(self.vocabulary_) == 0:

NotFittedError: Vocabulary not fitted or provided

How do I visualize merged models?

Thanks!

BERTopic Version

v0.16.3

MaartenGr commented 2 months ago

I'm missing the full error log (those "2 frames" that you have there). Without it I can't say exactly what the problem is. Having said that, you can use use_ctfidf=False to solve your problem.

shivamtawari commented 2 months ago

Hi, I forgot to mention the complete error log. Here it is:

NotFittedError                            Traceback (most recent call last)
<ipython-input-24-5238a0008058> in <cell line: 1>()
----> 1 hierarchical_topics_merged = merged_model.hierarchical_topics(docs_3)

2 frames
/usr/local/lib/python3.10/dist-packages/bertopic/_bertopic.py in hierarchical_topics(self, docs, use_ctfidf, linkage_function, distance_function)
   1101         # and will be removed in 1.2. Please use get_feature_names_out instead.
   1102         if version.parse(sklearn_version) >= version.parse("1.0.0"):
-> 1103             words = self.vectorizer_model.get_feature_names_out()
   1104         else:
   1105             words = self.vectorizer_model.get_feature_names()

/usr/local/lib/python3.10/dist-packages/sklearn/feature_extraction/text.py in get_feature_names_out(self, input_features)
   1483             Transformed feature names.
   1484         """
-> 1485         self._check_vocabulary()
   1486         return np.asarray(
   1487             [t for t, i in sorted(self.vocabulary_.items(), key=itemgetter(1))],

/usr/local/lib/python3.10/dist-packages/sklearn/feature_extraction/text.py in _check_vocabulary(self)
    506             self._validate_vocabulary()
    507             if not self.fixed_vocabulary_:
--> 508                 raise NotFittedError("Vocabulary not fitted or provided")
    509 
    510         if len(self.vocabulary_) == 0:

NotFittedError: Vocabulary not fitted or provided

MaartenGr commented 2 months ago

Ah, it seems that it truly needs a fitted vectorizer in order to run this model. Hmmm, the only thing that could solve is by running .update_topics with the documents of both models to recreate a vectorizer model before doing the hierarchical topic modeling.

MaartenGr / BERTopic