epfml / sent2vec

General purpose unsupervised sentence representations
Other
1.19k stars 256 forks source link

Best hyperparameters for text similarity search? #96

Closed 0x01h closed 4 years ago

0x01h commented 4 years ago

Hi,

I try to embed some sentences that have 5-20 number of tokens and have little syntactic differences. Also, I collected a corpus to train for that specific domain.

I think, -loss ns and selecting appropriate threshold of sampling -t to select "negative hard mining samples" are important.

What are the best hyperparameters to use these kind of case which focuses on text similarity?

mpagli commented 4 years ago

I'd use one of our pre-trained models for this. Try using the wiki or twitter model, this should give you already interesting results.