Closed patrickfrank1 closed 5 years ago
We used a dump of Wikipedia from 2015 if I recall. It has been tokenized and split into sentences using Stanford NLP. Too short and too long sentences were discarded from the corpus. What kind of preprocessing you did on you corpus?
Thanks for the clarification.
I used the enwik9 corpus for training, which should explain the performance gap since that corpus is much smaller than you wikipedia dump. I preprocessed using the wikifil perl script (Matt Mahoney) and also stanford nltk.
HI,
I was wondering which Wikipedia corpus you were using to train you sent2vec models? This is important to me, since I did not manage to reproduce your results yet - even when using your hyper parameters on the enwik9 corpus.