epfml / sent2vec

General purpose unsupervised sentence representations
Other
1.19k stars 256 forks source link

Training for other languages #55

Closed shanalikhan closed 5 years ago

shanalikhan commented 5 years ago

I'm looking forward to train for Urdu language.

https://github.com/epfml/sent2vec#training-new-models ./fasttext sent2vec -input wiki_sentences.txt -output my_model -minCount 8 -dim 700 -epoch 9 -lr 0.2 -wordNgrams 2 -loss ns -neg 10 -thread 20 -t 0.000005 -dropoutK 4 -minCountLabel 20 -bucket 4000000

Is the command is enough or we need to do some preprocessing like stemming for each of the word in the file.

martinjaggi commented 5 years ago

we recommend to train without any stemming etc. so the command you put should be sufficient. depending on the size and properties of your corpus you might have to adjust -minCount and other parameters