Closed NickShahML closed 7 years ago
Hi @NickShahML,
I used the Europarl v7 corpus that you can download here - http://www.statmt.org/europarl/v7/fr-en.tgz. I used the moses tokenizer (https://github.com/moses-smt/mosesdecoder) to tokenize the raw data.
The Europarl corpus has 1,964,110 lines. I guess I could look into having an automatic download and pre-process script for Europarl.
Hey @MaximumEntropy, no worries. I just wanted to see how big the corpus size was. For some reason, I thought it was much larger than 2 million training examples. Appreciate the help and I'll close this.
Hey @MaximumEntropy, thanks for such a nice, clean repo. I was wondering if there was a specific script you used to download the wmt data. Maybe you can point us to what you used?
Also, do you mind sharing how many training examples there are in the WMT data? It looks like you have ~5hr train time per epoch. I was wondering how many training examples was in each epoch.