Memory error with ~1m documents (no GPU available, low_memory=True)

MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.

https://maartengr.github.io/BERTopic/

MIT License

6.12k stars 763 forks source link

Memory error with ~1m documents (no GPU available, low_memory=True) #625

Closed Rorickt closed 2 years ago

Rorickt commented 2 years ago

Hi!

First, Thank you for the library, I'm really enjoying working with it!

I am working with documents that are multiple sentences. I split them up and work with each sentence. Afterward I (plan to) merge them back to end up with multiple topics per document.

However, after I split my data I have around a million sentences and this seems to crash the kernel when using fit_transform(). I get the error of not being able to allocate enough memory when doing topic reduction through umap. When I set that to low_memory = True, I get the same error. It uses up all 32gb of ram that I have available. I have calculate_probabilities = False

Should I just accept the limitations of my system and work with a smaller (randomized) subset of my full data to reduce the load? Or are them some tricks I can still apply?

MaartenGr commented 2 years ago

Thank you for your kind words!

Scalability can definitely be an issue when handling a million documents. Specifically for that reason, I created an FAQ page that has a bunch of tricks that can help you out with that! Hopefully, these should suffice in making it possible to train your model.

There are a few other tricks that you can do that might be a bit more advanced:

fit on a smaller portion of the data and transform on the rest
Use another dimensionality reduction algorithm like PCA or another clustering algorithm like k-Means
Use GPU-accelerated UMAP and HDBSCAN (see this page)
Speed up UMAP with PCA-initialization (see this page)

Rorickt commented 2 years ago

Thank you for your quick response!

I went through that page and it was indeed helpful! I have adjusted my parameters to follow those tips but to no avail. I do not have access to a GPU so unfortunately that is out. I was hoping to not have to fit on just a part of the data and transform on the rest as it would a bit of a shame ;)

I missed the tip of using PCA-acceleration so I'll try that too!

Also I just now saw there is another comment poster under issues dealing with this exactly! I'm sorry for repeating the question! There are good discussion in these sections and I should learn from there too!

Thanks again!

MaartenGr commented 2 years ago

No problem! Please feel free to post any questions or concerns you have even if they might already be mentioned somewhere else. It might happen that your use case is different and it would be a shame that a simple fix would be overlooked because of that 😄