Closed angrypark closed 6 years ago
Hi @angrypark,
In general, the more data, the better. Thus, I would suggest to train for 1 epoch over the full dataset of 1B sentences. May I ask why you cannot do more than 1 epoch?
Best, Edouard.
Hi @EdouardGrave, Thanks for the answer. The reason why I can't use more than 1 epoch for 1B sentences is because of my computer resource. It took 13 hours to train 100M sentences for 5 epochs (well I think tokenizer I use also slowed down the speed). But I think I should try to train 1B sentences for 1 epoch or 2 if available. Thanks a lot again :)
Regards, Sungnam.
Hi Sungnam,
One thing that might be helpful to train faster is to change the number of threads used by fastText: this is done by using the -thread
command line option. The default is 12
, and I would suggest to try different values, such as 2
, 4
or 8
, which might lead to better results based on your CPU.
Best, Edouard.
Hi. I have a large corpora (1B sentences, korean) and trying to train embedding with FastText. Paper(Learning Word Vectors for 157 Languages) suggests more iterations improves performance but it is impossible to train 1B sentences for 5 (or more) times. I want to choose whether to train 1B sentences 1 time or to train 100M sentences for 5 times. Do you have any suggestions?