Merging topic models - Githubissues

guymorlan commented 10 months ago

Hi,

Thank you so much for this amazing package. This is a question and not an issue, I hope it's appropriate to post here.

In my use case, I re-train BERTopic daily to capture new topics that arise continuously. I tried online learning methods using the River library but the results were significantly worse than one-off training with UMAP and HDBSCAN, so I'm re-training for now. Naturally some of the topics that arise in a given day are similar or equivalent to topics from yesterday.

I have two questions regarding this. First, are there recommended ways to identify that a topic is similar to a topic from a previous topic model (using c-TF-IDF distribution, vector for centroid, etc)? And second, having identified similarities in topics between two topic models, is it possible to merge the models to get one large topic model incorporating both?

Thanks!

MaartenGr commented 10 months ago

First, are there recommended ways to identify that a topic is similar to a topic from a previous topic model (using c-TF-IDF distribution, vector for centroid, etc)?

Generally, it would indeed either be cosine similarity on the c-TF-IDF (topic_model.c_tf_idf_) or embedding (topic_model.topic_embeddings_) representations.

And second, having identified similarities in topics between two topic models, is it possible to merge the models to get one large topic model incorporating both?

Yes, that should be possible if you focus on merely labeling the resulting topics and documents. For instance, let's assume that you have 2 topic model and a number of documents for both. After training, you would have identified topics that match and new topics. Each document would also be labeled with those merged and unmerged topics. Then, you can pass the documents and their assigned topics to a new model using manual topic modeling. This allows you to essentially merge two topic models. It will create new representations though since there are more documents to built the representations from.

dereke55 commented 8 months ago

I am doing something similar and also have found that the results of BERTopic are "better" than those of River. That being said, can either of you expand on the "cosine similarity on the c-TF-IDF (topic_model.c_tfidf) or embedding (topic_model.topicembeddings) representations"

Thank you in advance!

MaartenGr commented 8 months ago

@dereke55 You can calculate the c-TF-IDF representations for both the documents and topics and compare them with cosine similarity. That way, you can quickly view which documents belong to which topics and the extent to which they are similar. The same can be done with embedding both the documents and topics.

Do note though that in the main branch there is a new feature for merging topic models as described extensively here: https://github.com/MaartenGr/BERTopic/pull/1516

MaartenGr / BERTopic

Merging topic models #1471