MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.08k stars 757 forks source link

Reduce_outliers() crashes RAM in google colab #1286

Closed eneimdrop closed 1 year ago

eneimdrop commented 1 year ago

Hello Maarten ! First of all thank you very much for this package, your work and your quick and great answers to the issues. I have been running BERTopic on a dataset of 100 000 news articles in french for my topic modelling study for my internship. Unfortuately, i have run into quite a few crashes of the colab environment during this time. Particularly, I have trained a model on the dataset with the default parameters (except for language = 'multilingual') and everything runs fine. But when I try to reduce the outliers using the reduce_outliers() function, the RAM usage shoots up and crashes the environment. It seems to happen when i use more than 20 000 documents to train the model, since it works perfectly well before that threshhold. Do you have any idea why this happens ? Thanks a lot in advance

MaartenGr commented 1 year ago

The default strategy for reducing outliers is using topic distributions calculated with c-TF-IDF. This might become computationally difficult if you pass that many documents at once. Instead, it might be worthwhile to pass it in batches of ~1000 documents instead.

eneimdrop commented 1 year ago

1000 was apparently still too big when i try to calculate in batches, for some reason (probably poor coding from my part), but batches of size 250 worked out fine ! Thanks for the advice !