epfml / sent2vec

General purpose unsupervised sentence representations
Other
1.19k stars 256 forks source link

How to calculate amount of RAM and amount of space on hard drive for successfully training and saving model? #40

Closed darentsia closed 6 years ago

darentsia commented 6 years ago

Hello! How to calculate what amount of RAM and how much free space on a hard drive I need to have during training?

For example: I have .txt file of 100Gb (docs + tweets), nearly 355 000 000 docs, does your model stores in RAM one vector for each doc + dictionary for all words from these docs?

  1. Can you, please, explain, what your model stores in RAM during training?

  2. And how final model size depends on the size of the input file or the number of docs in the input file?

  3. And how to receive an approximate estimate of RAM/Hard?

  4. Can you, please, provide and clear explanation in terms like in the answer from the post about doc2vec: https://stackoverflow.com/questions/45943832/gensim-doc2vec-finalize-vocab-memory-error

martinjaggi commented 6 years ago

we only store the dictionary/word vectors in RAM. there is no vector or anything per document. so the final model size is only determined by your vocabulary size and dimension, both are free to your choice.

darentsia commented 6 years ago

Thank you!

mpagli commented 6 years ago

Embedding matrices are float32, you have two matrices of size dim * vocab_size and one of size dim * bucket. Those matrices represent most of the RAM space used. If your experiment uses too much RAM, reduce the vocabulary size by increasing the min_count. There is no dependency between the size of the corpus and the size of the matrices, beside the tendency of vocab_size to increase with the size of your corpus (for a fixed min_count). A good vocab size is 200k or so, it depends on the task.

darentsia commented 6 years ago

Thank you, @mpagli for your detailed response! It helped!