epfml / sent2vec

General purpose unsupervised sentence representations
Other
1.19k stars 256 forks source link

How 'input.txt' should look like? #38

Closed darentsia closed 6 years ago

darentsia commented 6 years ago

Hello! I found your answer about how each sentence should look like:

it is important for machine_learning

But if I want to build my own model, what should I feed to sent2vec?

../sent2vec/fasttext sent2vec -input data/documents_mixed.txt -output my_model -minCount 8 -dim 700 -epoch 10 -lr 0.2 -wordNgrams 2 -loss ns -neg 10 -thread 36 -t 0.000005 -dropoutK 4 -minCountLabel 20 -bucket 4000000

How does each sentence need to be separated from each other?

For example:

1 sentence: "i love my dog" 2 sentence: "my dog loves me"

input.txt: "i love my dog", "my dog loves me"
or: "i love my dog" "my dog loves me"
or "i love my dog" (new line) "my dog loves me"

Thank you in advance!

And one more question: For a very large corpus (84 Gb - tweets and web docs) what are your recommendations for parameters t, dropoutK and bucket? And is it possible to change linearly learning rate to particular min learning rate value?

martinjaggi commented 6 years ago

yes, training file needs to be one sentence per line, see https://github.com/epfml/sent2vec#training-new-models

the corpus size you mention should be fine we'd assume (maybe try to tune the hyper-parameters on a smaller subset of it, and start from our default ones)

hope this helps a bit

darentsia commented 6 years ago

@martinjaggi Thank you! 1 string per line, yep? Not list per line or other types, yes?

martinjaggi commented 6 years ago

yes

On Mon, Jul 23, 2018, 1:25 PM daridar notifications@github.com wrote:

@martinjaggi https://github.com/martinjaggi Thank you! 1 string per line, yep? Not list per line or other types, yes?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/epfml/sent2vec/issues/38#issuecomment-407025143, or mute the thread https://github.com/notifications/unsubscribe-auth/AEaGR_1tEIRKP-cahb0ncovbaXdUpkbxks5uJbKhgaJpZM4VaqZv .

darentsia commented 6 years ago

Thank you!