MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.19k stars 765 forks source link

'Merge_Models' with new topic_model from outliers #2222

Open OnAnd0n opened 3 days ago

OnAnd0n commented 3 days ago

I would like to utilize 'Merge_Models' in BERTopic to re-cluster the outliers with HDBScan and merge them with the existing topics.

However, there are currently some challenges with the Merge_Models functionality:

  1. When merging the Topic_model (including all data, with outliers) and the Out_Topic_model (consisting only of outliers), the 'Count' of the Topic_model for -1 increases by the number of outliers, instead of effectively concat them.

  2. The Representative_docs are displayed as NaN. => is the only way?

My BERTopic Version is 0.16.3

How can these issues be resolved?

MaartenGr commented 12 hours ago

When merging the Topic_model (including all data, with outliers) and the Out_Topic_model (consisting only of outliers), the 'Count' of the Topic_model for -1 increases by the number of outliers, instead of effectively concat them.

I have a hard time understanding what you exactly mean here. Could you give an example? Perhaps showcase what is happening and what you would expect to happen?

The Representative_docs are displayed as NaN. => is the only way?

The representative documents are indeed displayed as NaN since merge_models is also meant for federated learning. If you want representative documents re-calculated, I would advise checking the issues page. I believe there are a number of issues that describe in detail how you can do this.