Tomotopy currently loads all of documents before training, and then it trains on these documents.
However, what I find is that I have a very large corpus (about 750,000 documents) and if I want to train on a portion of these documents, I am heavily ram limited. Even loading 20,000 documents will create a situation where my scrip takes up 20GB of ram.
Gensim has the ability to stream an iterable document corpus, which makes it more scalable in terms of ram. Is there a possibility to adjust Tomotopy so that it would have a similar capability that would allow one to train on a larger dataset?
Tomotopy currently loads all of documents before training, and then it trains on these documents.
However, what I find is that I have a very large corpus (about 750,000 documents) and if I want to train on a portion of these documents, I am heavily ram limited. Even loading 20,000 documents will create a situation where my scrip takes up 20GB of ram.
Gensim has the ability to stream an iterable document corpus, which makes it more scalable in terms of ram. Is there a possibility to adjust Tomotopy so that it would have a similar capability that would allow one to train on a larger dataset?