Open jbehnk opened 8 months ago
Thanks for the suggestion! I agree, having something similar would be a nice experience. There are two concerns though, the underlying countvectorizers might be different so recalculating the representative documents might not be the same as they were initially. More specifically, you are now using the default countvectorizer and neither of the original models.
Aside from that, it might be worthwhile to simply have this feature in .get_representative_docs
where you can indicate to recalculate them if necessary.
@MaartenGr @jbehnk
I am going through the same issue, that is because of limitations on computation resources and UMAP from cuml.manifold and HDBSCAN from cuml.cluster models not supporting partial_fit, I have to go with merge_models strategy;
If we know for sure that the same vectoriser model was used while training both models that were merged, so why not do
merged_model_8.update_topics(docs, vectorizer_model=vectorizer_model)
Because that recalculates c_tfidf and topicrepresentations both.
or am I missing something in my understanding of it?
@alihashaam That is indeed an option that should fix the problem assuming you don't mind recalculating both c-TF-IDF as well as the topic representations. I have had users that only wanted to recalculate c-TF-IDF for instance.
@MaartenGr Thank you for the helpful feedback! I dont see any harm in updating the topics_representations as in my case, the underlying data sources are same, it is just because of computational limit that I have to go with merge_models strategy. So in this case not updating topic_representation wont be that nice actually, or that's what I think.
I have one more (possibly naive) question: When using update_topics
or recalculating c_tf_idf_
, should we pass in all the documents used to train both models being merged, or only the documents used to train the most recent model that is being merged (i.e., the one after which update_topics
is called)?
For example, in the scenario below:
topic_model_1 = BERTopic(min_topic_size=5).fit(docs[:4000])
topic_model_2 = BERTopic(min_topic_size=5).fit(docs[4000:8000])
topic_model_3 = BERTopic(min_topic_size=5).fit(docs[8000:])
merged_model = BERTopic.merge_models([topic_model_1, topic_model_2], similarity=0.9)
merged_model.update_topics(docs[4000:8000], vectorizer_model=self.vectorizer_model) # or should it be docs[:8000]
....
merged_model_2 = BERTopic.merge_models([merged_model, topic_model_3], similarity=0.9)
merged_model_2.update_topics(docs[8000:], vectorizer_model=self.vectorizer_model) # or should it be docs
The question stems from the fact that if you look at https://github.com/MaartenGr/BERTopic/blob/eba1d3443e81aa6cd1b3ef41048ee612c4df1230/bertopic/_bertopic.py#L1553
in update_topics, c_tf_idf_
is getting calculated without fitting again on new data but only doing transform() eventually.
Thank you
I have one more (possibly naive) question: When using update_topics or recalculating c_tfidf, should we pass in all the documents used to train both models being merged, or only the documents used to train the most recent model that is being merged (i.e., the one after which update_topics is called)?
You should use all documents from both models. The reason for this is that almost all information from both models are kept, this includes the predictions of each document from both models.
Hello! I have been using the merged models to avoid RAM limitations. After merging my models into a new model, I found that there are no representative documents in model.get_topic_info() and also model.hierarchical_topics(docs) is not working. I solved the problem with the help of:
It would be good to have a function/option that does this automatically.