Closed StillKeepTry closed 6 years ago
Yes, please check the README.md file in the pre-trained En-Fr models that was added recently. Here's the relevant portion:
Setup notes
- Data is from http://statmt.org/wmt14/translation-task.html#Download
- Pre-processing with scripts from https://github.com/moses-smt/mosesdecoder
- ./scripts/tokenizer/normalize-punctuation.perl
- ./scripts/tokenizer/remove-non-printing-char.perl
- ./scripts/tokenizer/tokenizer.perl
- Remove all sentences with more than 175 words
- Remove all sentences with a source/length ratio exceeding 1.5
- Data is splitted in train and valid:
- train contains 35482654 sentences
- valid contains 26663 sentences
- We use a BPE sub-word vocublary with 40k tokens, trained jointly on the training data for both languages (see bpecodes)
Thx @jgehring,
Do I have to "Remove all sentences with more than 175 words" and "Remove all sentences with a source/length ratio exceeding 1.5" both before and after bpe?
BPE encoding is the very last step, and the encoding is trained on the pre-processed and length-filtered training split.
Thx. According to the README.md in pretrained wmt en-de, "No further pre-processing" means only use ./scripts/tokenizer/normalize-punctuation.perl ./scripts/tokenizer/remove-non-printing-char.perl ./scripts/tokenizer/tokenizer.perl ?
We used the pre-processed data from https://nlp.stanford.edu/projects/nmt/ for WMT14 English-German to be compatible with previous work who also used the same version of the data. Unfortunately, we do not have access to exact procedure on how the data was pre-processed. To be sure, you can check with the authors of the page above.
Thank you. @michaelauli , I've missed the data source from README.md. With the stanford data, I process data with:
It seems no difference to yours, while the pre-trained dictionary has
| [en] Dictionary: 42243 types
| [de] Dictionary: 43676 types
, I get
| [en] Dictionary: 36645 types
| [en] temp/subword/train.en: 4424152 sents, 130458910 tokens, 0.01% replaced by
anything else needs to pay attention to?
How did you call fairseq preprocess
? As you're limiting the vocabulary with BPE already, you should not specify any dictionary cutoff for fairseq preprocess
.
You hit the point, after removing "-thresholdsrc 3 -thresholdtgt 3",
I get
| [en] Dictionary: 41953 types
| [en] temp/subword/train.en: 4424152 sents, 130458910 tokens, 0.00% replaced by
Thank you :)
Another problem still make me fail in training wmt14 en-fr. Can you help me check my log on google drive? Thank you.
https://drive.google.com/open?id=1hbk9pCsMsiZP1eqcZ6fV7Ee-Vp0S2HeO
I've tried three times. Is it out of memory or something else?
Did you check that none of the sentences exceeds 1024 words? The error looks like the position lookup table is out of bounds.
Thank you, now I tried to clean-corpus-n.perl -ratio 1.5 temp/train.subword en fr temp/subword/train 1 400
after BPE, I'll wait for the result.
But i'm confused at WHY long tokens input leads to look up table OUT OF BOUND?
This is simply a hard-coded constant in the code. You can change it to something larger if needed here: https://github.com/facebookresearch/fairseq/blob/master/fairseq/models/fconv_model.lua#L177
Ah, it's the position encoding lookup table, thank you.
Hey, actually it's still not answered why in Attention is All You Need (https://arxiv.org/abs/1706.03762) the training set claims to be 36M, while the actual size (before preprocessing) is 40.8M. Would anyone kindly help to explain that?
@Psycoy PS: In most papers, they would claim that the WMT2014 ENDE's training set has the size of 4.5M and WMT2014 ENFR has the size of 36M. While the actual sizes of the downloaded training sets are 4.5M (3.9M after preprocessing) for WMT2014 ENDE and 40.8M (35.8M after preprocessing) for WMT2014 ENFR respectively.
It seems that the size of WMT2014 ENDE before preprocessing and the size of WMT2014 ENFR after preprocessing are the numbers that are claimed by most of the papers, which is strange because the papers should have reported the sizes under the same conditions (both not preprocessed or both preprocessed). Here "preprocessing" means deleting the data samples that exceed a certain length, which will make the set smaller.
Here are the corpora I used for the two training sets (WMT2014 ENDE and WMT2014 ENFR), which is also the same as WMT2014 homepage:
WMT2014 ENDE:
WMT2014 ENFR:
Any problems? Can anyone explain it?
The WMT14 include the following corpus:
This entire corpus contains 40.8 M sentences, differ from 36M as paper reported. Is any other data process details?