facebookresearch / fastText

Library for fast text representation and classification.
https://fasttext.cc/
MIT License
25.85k stars 4.71k forks source link

1 iteration for large data vs 5 iterations for part of data #611

Closed angrypark closed 6 years ago

angrypark commented 6 years ago

Hi. I have a large corpora (1B sentences, korean) and trying to train embedding with FastText. Paper(Learning Word Vectors for 157 Languages) suggests more iterations improves performance but it is impossible to train 1B sentences for 5 (or more) times. I want to choose whether to train 1B sentences 1 time or to train 100M sentences for 5 times. Do you have any suggestions?

EdouardGrave commented 6 years ago

Hi @angrypark,

In general, the more data, the better. Thus, I would suggest to train for 1 epoch over the full dataset of 1B sentences. May I ask why you cannot do more than 1 epoch?

Best, Edouard.

angrypark commented 6 years ago

Hi @EdouardGrave, Thanks for the answer. The reason why I can't use more than 1 epoch for 1B sentences is because of my computer resource. It took 13 hours to train 100M sentences for 5 epochs (well I think tokenizer I use also slowed down the speed). But I think I should try to train 1B sentences for 1 epoch or 2 if available. Thanks a lot again :)

Regards, Sungnam.

EdouardGrave commented 6 years ago

Hi Sungnam,

One thing that might be helpful to train faster is to change the number of threads used by fastText: this is done by using the -thread command line option. The default is 12, and I would suggest to try different values, such as 2, 4 or 8, which might lead to better results based on your CPU.

Best, Edouard.