Closed Rorickt closed 2 years ago
Thank you for your kind words!
Scalability can definitely be an issue when handling a million documents. Specifically for that reason, I created an FAQ page that has a bunch of tricks that can help you out with that! Hopefully, these should suffice in making it possible to train your model.
There are a few other tricks that you can do that might be a bit more advanced:
Thank you for your quick response!
I went through that page and it was indeed helpful! I have adjusted my parameters to follow those tips but to no avail. I do not have access to a GPU so unfortunately that is out. I was hoping to not have to fit on just a part of the data and transform on the rest as it would a bit of a shame ;)
I missed the tip of using PCA-acceleration so I'll try that too!
Also I just now saw there is another comment poster under issues dealing with this exactly! I'm sorry for repeating the question! There are good discussion in these sections and I should learn from there too!
Thanks again!
No problem! Please feel free to post any questions or concerns you have even if they might already be mentioned somewhere else. It might happen that your use case is different and it would be a shame that a simple fix would be overlooked because of that 😄
Hi!
First, Thank you for the library, I'm really enjoying working with it!
I am working with documents that are multiple sentences. I split them up and work with each sentence. Afterward I (plan to) merge them back to end up with multiple topics per document.
However, after I split my data I have around a million sentences and this seems to crash the kernel when using
fit_transform()
. I get the error of not being able to allocate enough memory when doing topic reduction through umap. When I set that tolow_memory = True
, I get the same error. It uses up all 32gb of ram that I have available. I havecalculate_probabilities = False
Should I just accept the limitations of my system and work with a smaller (randomized) subset of my full data to reduce the load? Or are them some tricks I can still apply?