bab2min / tomotopy

Python package of Tomoto, the Topic Modeling Tool
https://bab2min.github.io/tomotopy
MIT License
548 stars 62 forks source link

Ability to stream corpus data to LDAModel (or any other model) #162

Open jalustig opened 2 years ago

jalustig commented 2 years ago

Tomotopy currently loads all of documents before training, and then it trains on these documents.

However, what I find is that I have a very large corpus (about 750,000 documents) and if I want to train on a portion of these documents, I am heavily ram limited. Even loading 20,000 documents will create a situation where my scrip takes up 20GB of ram.

Gensim has the ability to stream an iterable document corpus, which makes it more scalable in terms of ram. Is there a possibility to adjust Tomotopy so that it would have a similar capability that would allow one to train on a larger dataset?