MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.17k stars 764 forks source link

how to productionze it to run on saved model? #1571

Open sdave-connexion opened 1 year ago

sdave-connexion commented 1 year ago

Hello, I have created my model but every week I don't want to run the whole analysis again, I would like to only take the new feedback that I receive and run through the model and feedback gets assigned to my current clusters directly.

I looked at online topic modelling but it doesn't work after the .fit part.

Thank you in advance

Best, Shantanu

clstaudt commented 1 year ago

By "run the analysis" you mean calling the fit method? By "run it through the model" you mean calling the transform method? Unless I misunderstand something, isn't that an obvious solution right there?

sdave-connexion commented 1 year ago

@clstaudt Hi, I've used topic modeling to categorize a dataset of 100,000 feedbacks into 20 main topics. As new data arrives—around 5,000 feedbacks each week—I want to efficiently categorize this fresh feedback into those pre-established 20 topics. I'm looking for an approach where I can apply the already trained model to these new entries without having to re-run the model on the entire, ever-growing dataset.

What's the best way to achieve this incremental categorization?

for example - more or less, what I'd like to do is akin to how in supervised learning we can use a trained model to predict labels for new, unseen data.

clstaudt commented 1 year ago

I believe BERTopic works out of the box just like other ML models that you familiar with:

  1. train the BERTopic model on your training set (the 100k feedbacks) via a call to fit and validate the topics
  2. save the model
  3. load the model whenever you need to label a new batch of feedbacks
  4. do so by calling the model's transform method on the new batch -> the result is an assignment of each new feedback to one (or several) of the pre-established topics (or the outlier topic)
MaartenGr commented 1 year ago

Indeed! After having created your model with fit_transform or fit you can simply run transform to do this:

Hello, I have created my model but every week I don't want to run the whole analysis again, I would like to only take the new feedback that I receive and run through the model and feedback gets assigned to my current clusters directly.