MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.08k stars 757 forks source link

Outliers in test dataset after Outlier Reduction Technique #1129

Closed mohit-monpara closed 1 year ago

mohit-monpara commented 1 year ago

Thanks for creating the awesome repository. I'm currently experiencing issues with the model prediction and was hoping you could offer some guidance. (BerTopic Version: v0.13.0)

To provide some context, I trained the model on a dataset and utilized an outlier reduction technique to update the model. I then exported the updated model and attempted to predict using it. Unfortunately, I have been getting outliers (-1) in the prediction, even when utilizing the updated model for prediction. In fact, approximately 35% of the test dataset is returning outliers.

I was wondering if you might be able to suggest some solutions or best practices to achieve better results after applying outlier reduction techniques. Any advice you could offer would be greatly appreciated.

MaartenGr commented 1 year ago

When you run .reduce_outliers it maps the input outlier topics to any non-outlier topic. However, the underlying cluster model itself is not updated with this function. In other words, you would have to use the same strategy when predicting unseen documents.

MaartenGr commented 1 year ago

Due to inactivity, I'll be closing this issue. Let me know if you want me to re-open the issue!