Which Wikipedia Corpus, for pretrained models?

epfml / sent2vec

General purpose unsupervised sentence representations

Other

1.19k stars 256 forks source link

Which Wikipedia Corpus, for pretrained models? #81

Closed patrickfrank1 closed 5 years ago

patrickfrank1 commented 5 years ago

HI,

I was wondering which Wikipedia corpus you were using to train you sent2vec models? This is important to me, since I did not manage to reproduce your results yet - even when using your hyper parameters on the enwik9 corpus.

mpagli commented 5 years ago

We used a dump of Wikipedia from 2015 if I recall. It has been tokenized and split into sentences using Stanford NLP. Too short and too long sentences were discarded from the corpus. What kind of preprocessing you did on you corpus?

patrickfrank1 commented 5 years ago

Thanks for the clarification.

I used the enwik9 corpus for training, which should explain the performance gap since that corpus is much smaller than you wikipedia dump. I preprocessed using the wikifil perl script (Matt Mahoney) and also stanford nltk.