MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.2k stars 767 forks source link

option to recalculate c_tf_idf_, topic_representations_ and representative_docs_ after merging after merging models #1878

Open jbehnk opened 8 months ago

jbehnk commented 8 months ago

Hello! I have been using the merged models to avoid RAM limitations. After merging my models into a new model, I found that there are no representative documents in model.get_topic_info() and also model.hierarchical_topics(docs) is not working. I solved the problem with the help of:

#  Create a df with the following columns ["Document", "Topic", "ID"]
docs = data["text_caption"].values
topics = data["topic"].values
ids = range(len(docs))
images = None

documents = pd.DataFrame({"Document": docs, "Topic": topics, "ID": ids, "Image": images})
documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})

merged_model_8.c_tf_idf_, words = merged_model_8._c_tf_idf(documents_per_topic)

merged_model_8.topic_representations_ = merged_model_8._extract_words_per_topic(words, documents, merged_model_8.c_tf_idf_, calculate_aspects=False)
merged_model_8._save_representative_docs(documents)

It would be good to have a function/option that does this automatically.

MaartenGr commented 8 months ago

Thanks for the suggestion! I agree, having something similar would be a nice experience. There are two concerns though, the underlying countvectorizers might be different so recalculating the representative documents might not be the same as they were initially. More specifically, you are now using the default countvectorizer and neither of the original models.

Aside from that, it might be worthwhile to simply have this feature in .get_representative_docs where you can indicate to recalculate them if necessary.

alihashaam commented 1 month ago

@MaartenGr @jbehnk

I am going through the same issue, that is because of limitations on computation resources and UMAP from cuml.manifold and HDBSCAN from cuml.cluster models not supporting partial_fit, I have to go with merge_models strategy;

If we know for sure that the same vectoriser model was used while training both models that were merged, so why not do

merged_model_8.update_topics(docs, vectorizer_model=vectorizer_model)

Because that recalculates c_tfidf and topicrepresentations both.

or am I missing something in my understanding of it?

MaartenGr commented 1 month ago

@alihashaam That is indeed an option that should fix the problem assuming you don't mind recalculating both c-TF-IDF as well as the topic representations. I have had users that only wanted to recalculate c-TF-IDF for instance.

alihashaam commented 1 month ago

@MaartenGr Thank you for the helpful feedback! I dont see any harm in updating the topics_representations as in my case, the underlying data sources are same, it is just because of computational limit that I have to go with merge_models strategy. So in this case not updating topic_representation wont be that nice actually, or that's what I think.

I have one more (possibly naive) question: When using update_topicsor recalculating c_tf_idf_, should we pass in all the documents used to train both models being merged, or only the documents used to train the most recent model that is being merged (i.e., the one after which update_topics is called)? For example, in the scenario below:

topic_model_1 = BERTopic(min_topic_size=5).fit(docs[:4000])
topic_model_2 = BERTopic(min_topic_size=5).fit(docs[4000:8000])
topic_model_3 = BERTopic(min_topic_size=5).fit(docs[8000:])

merged_model = BERTopic.merge_models([topic_model_1, topic_model_2], similarity=0.9)
merged_model.update_topics(docs[4000:8000], vectorizer_model=self.vectorizer_model) # or should it be docs[:8000]
....
merged_model_2 = BERTopic.merge_models([merged_model, topic_model_3], similarity=0.9)
merged_model_2.update_topics(docs[8000:], vectorizer_model=self.vectorizer_model) # or should it be docs

The question stems from the fact that if you look at https://github.com/MaartenGr/BERTopic/blob/eba1d3443e81aa6cd1b3ef41048ee612c4df1230/bertopic/_bertopic.py#L1553 in update_topics, c_tf_idf_ is getting calculated without fitting again on new data but only doing transform() eventually.

Thank you

MaartenGr commented 1 month ago

I have one more (possibly naive) question: When using update_topics or recalculating c_tfidf, should we pass in all the documents used to train both models being merged, or only the documents used to train the most recent model that is being merged (i.e., the one after which update_topics is called)?

You should use all documents from both models. The reason for this is that almost all information from both models are kept, this includes the predictions of each document from both models.