Closed darentsia closed 6 years ago
yes, training file needs to be one sentence per line, see https://github.com/epfml/sent2vec#training-new-models
the corpus size you mention should be fine we'd assume (maybe try to tune the hyper-parameters on a smaller subset of it, and start from our default ones)
hope this helps a bit
@martinjaggi Thank you! 1 string per line, yep? Not list per line or other types, yes?
yes
On Mon, Jul 23, 2018, 1:25 PM daridar notifications@github.com wrote:
@martinjaggi https://github.com/martinjaggi Thank you! 1 string per line, yep? Not list per line or other types, yes?
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/epfml/sent2vec/issues/38#issuecomment-407025143, or mute the thread https://github.com/notifications/unsubscribe-auth/AEaGR_1tEIRKP-cahb0ncovbaXdUpkbxks5uJbKhgaJpZM4VaqZv .
Thank you!
Hello! I found your answer about how each sentence should look like:
But if I want to build my own model, what should I feed to sent2vec?
../sent2vec/fasttext sent2vec -input data/documents_mixed.txt -output my_model -minCount 8 -dim 700 -epoch 10 -lr 0.2 -wordNgrams 2 -loss ns -neg 10 -thread 36 -t 0.000005 -dropoutK 4 -minCountLabel 20 -bucket 4000000
How does each sentence need to be separated from each other?
For example:
Thank you in advance!
And one more question: For a very large corpus (84 Gb - tweets and web docs) what are your recommendations for parameters t, dropoutK and bucket? And is it possible to change linearly learning rate to particular min learning rate value?