MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.1k stars 761 forks source link

Your session crashed after using all available RAM #431

Closed benstaf closed 2 years ago

benstaf commented 2 years ago

I want to build a model with 500k documents.

I got this error message on Google Colab: Your session crashed after using all available RAM. If you are interested in access to high-RAM runtimes, you may want to check out Colab Pro

Is there a way to train the model and get the top topic (no need for all the prob matrix) by optimizing the use of resources from Colab Free Tier?

MaartenGr commented 2 years ago

There are a few ways in reducing the necessary RAM for BERTopic but there are limits to that. The most impactful solutions can be found in the documentation here. I would advise reading through them and trying them all out.

However, although the tips above might reduce the necessary RAM, it does not mean that you can run BERTopic across all RAM sizes. If the data becomes large enough, then you would need more RAM.

Hopefully, the tips above solve the issue for you!

benstaf commented 2 years ago

I read them and tried them all without any success.

This algorithm is in O(n^2) in memory use, so without agressive optimization or hyper-parameterization, it's unusable for large datasets.

I found a better algorithm in this GitHub project: https://github.com/MilaNLProc/contextualized-topic-models

It only used 8 GB RAM of Google colab free tier for my 500k corpus, and output looked reasonable.

To keep relevance, the BERTopic project will greatly benefit from a benchmark demonstrating its performance with respect to alternative algorithms, especially those consuming less memory

Moreover, providing recommended values for hyperparameters w.r.t corpus size will facilitate usage in practice.

Good luck 👍

MaartenGr commented 2 years ago

Sorry to hear that the proposed solutions were not sufficient for your use case. Indeed, there are fortunately plenty of topic modeling techniques to use that might suit your use case much better. Keeping that in mind, it might be interesting to also look at OCTIS. A package co-developed by one of the authors of contextualized topic models implements quite a few topic modeling techniques, including performance measurements that might be interesting to you.

I would like to stress though that BERTopic is not unusable for large datasets if sufficient RAM is available. Having said that, I understand that increasing RAM may not be straightforward or possible to all users, and will make sure to put it on the roadmap!

If you decide to use BERTopic and run into issues or have some questions, let me know and I'll be glad to help!