Closed darentsia closed 6 years ago
we only store the dictionary/word vectors in RAM. there is no vector or anything per document. so the final model size is only determined by your vocabulary size and dimension, both are free to your choice.
Thank you!
Embedding matrices are float32, you have two matrices of size dim * vocab_size
and one of size dim * bucket
. Those matrices represent most of the RAM space used.
If your experiment uses too much RAM, reduce the vocabulary size by increasing the min_count
. There is no dependency between the size of the corpus and the size of the matrices, beside the tendency of vocab_size
to increase with the size of your corpus (for a fixed min_count). A good vocab size is 200k or so, it depends on the task.
Thank you, @mpagli for your detailed response! It helped!
Hello! How to calculate what amount of RAM and how much free space on a hard drive I need to have during training?
For example: I have .txt file of 100Gb (docs + tweets), nearly 355 000 000 docs, does your model stores in RAM one vector for each doc + dictionary for all words from these docs?
Can you, please, explain, what your model stores in RAM during training?
And how final model size depends on the size of the input file or the number of docs in the input file?
And how to receive an approximate estimate of RAM/Hard?
Can you, please, provide and clear explanation in terms like in the answer from the post about doc2vec: https://stackoverflow.com/questions/45943832/gensim-doc2vec-finalize-vocab-memory-error