epfml / sent2vec

General purpose unsupervised sentence representations
Other
1.19k stars 256 forks source link

Memory Size #35

Closed sxhmilyoyo closed 6 years ago

sxhmilyoyo commented 6 years ago

When I use twitter_bigrams.bin, I got an error:

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted (core dumped)

It seems like it is out of memory, so does it mean I need 23 GB memory when I use twitter_bigrams.bin?

Thanks.

darentsia commented 6 years ago

I have the same issue while I'm trying to train my own model using 80Gb corpus.

Read 10670M words Number of words: 3788685 Number of labels: 0 terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc Aborted

Have you found a way to fix it?

Thanks.

mpagli commented 6 years ago

At both inference and training times you need to be able to have the embedding matrices in RAM. If you don't have so much RAM, you can play with those training hyperparameters:

I would start by reducing my vocabulary size, then if this is still too much I would reduce the bucket size, finally I would decrease the embedding size. Decreasing the vocabulary size is likely not to affect the quality of the end sentence embeddings too much (or at all depending on applications), but decreasing the dim parameter will.

That was for training, sadly at inference time you cannot reduce the size of the matrices and you have to allocate space according to the training specifications used. It is true that the published models could be smaller without losing much in quality, yet in a research setup it was the safe option to use large vocabulary sizes.

darentsia commented 6 years ago

Hello, @mpagli ! I changed bucket size before your comment and right now It's still training, if It will fail again, I'll try to reduce minCount and dim size. Thank you for your comment! It helped a lot.

mpagli commented 6 years ago

Great to hear you found a workaround!

Just as a tip though: I would expect better embeddings without altering the default bucket size and instead limiting the vocabulary size to about 100k - 500k using minCount.