I looked in the documentation and I could not find any tooling to build a lexicon when the Corpus can't fit on memory.
Let's say I want to build tf-idf vectors for a given lexicon of 10 million ngrams, but I can't fit in memory all the text files that I need to know there are 10 million ngrams in the corpus.
What I would like to do is to build incrementally the lexicon with batches of documents that I load (but note that I don't want to keep all the text of the documents, just tokenize them to learn the lexicon from the data).
for batch_of_documents in folder:
update!(lexicon, batch_of_documents, tokenizer)
and then
m = DocumentTermMatrix(["some text here", "here more text"]; lexicon, tokenizer )
I looked in the documentation and I could not find any tooling to build a lexicon when the Corpus can't fit on memory.
Let's say I want to build tf-idf vectors for a given lexicon of 10 million ngrams, but I can't fit in memory all the text files that I need to know there are 10 million ngrams in the corpus.
What I would like to do is to build incrementally the lexicon with batches of documents that I load (but note that I don't want to keep all the text of the documents, just tokenize them to learn the lexicon from the data).
and then
Is there a way to do this?