MaximumEntropy / Seq2Seq-PyTorch

Sequence to Sequence Models with PyTorch
Do What The F*ck You Want To Public License
734 stars 161 forks source link

Download Script for WMT Data? #2

Closed NickShahML closed 7 years ago

NickShahML commented 7 years ago

Hey @MaximumEntropy, thanks for such a nice, clean repo. I was wondering if there was a specific script you used to download the wmt data. Maybe you can point us to what you used?

Also, do you mind sharing how many training examples there are in the WMT data? It looks like you have ~5hr train time per epoch. I was wondering how many training examples was in each epoch.

MaximumEntropy commented 7 years ago

Hi @NickShahML,

I used the Europarl v7 corpus that you can download here - http://www.statmt.org/europarl/v7/fr-en.tgz. I used the moses tokenizer (https://github.com/moses-smt/mosesdecoder) to tokenize the raw data.

The Europarl corpus has 1,964,110 lines. I guess I could look into having an automatic download and pre-process script for Europarl.

NickShahML commented 7 years ago

Hey @MaximumEntropy, no worries. I just wanted to see how big the corpus size was. For some reason, I thought it was much larger than 2 million training examples. Appreciate the help and I'll close this.