facebookresearch / fairseq-lua

Facebook AI Research Sequence-to-Sequence Toolkit
Other
3.74k stars 616 forks source link

The Corpus in WMT14 English - French #59

Closed StillKeepTry closed 6 years ago

StillKeepTry commented 7 years ago

The WMT14 include the following corpus:

  1. commoncrawl
  2. europarl-v7
  3. giga
  4. news-commentary
  5. undoc

This entire corpus contains 40.8 M sentences, differ from 36M as paper reported. Is any other data process details?

jgehring commented 7 years ago

Yes, please check the README.md file in the pre-trained En-Fr models that was added recently. Here's the relevant portion:

Setup notes

  • Data is from http://statmt.org/wmt14/translation-task.html#Download
  • Pre-processing with scripts from https://github.com/moses-smt/mosesdecoder
    • ./scripts/tokenizer/normalize-punctuation.perl
    • ./scripts/tokenizer/remove-non-printing-char.perl
    • ./scripts/tokenizer/tokenizer.perl
  • Remove all sentences with more than 175 words
  • Remove all sentences with a source/length ratio exceeding 1.5
  • Data is splitted in train and valid:
    • train contains 35482654 sentences
    • valid contains 26663 sentences
  • We use a BPE sub-word vocublary with 40k tokens, trained jointly on the training data for both languages (see bpecodes)
Zrachel commented 7 years ago

Thx @jgehring,

Do I have to "Remove all sentences with more than 175 words" and "Remove all sentences with a source/length ratio exceeding 1.5" both before and after bpe?

jgehring commented 7 years ago

BPE encoding is the very last step, and the encoding is trained on the pre-processed and length-filtered training split.

Zrachel commented 7 years ago

Thx. According to the README.md in pretrained wmt en-de, "No further pre-processing" means only use ./scripts/tokenizer/normalize-punctuation.perl ./scripts/tokenizer/remove-non-printing-char.perl ./scripts/tokenizer/tokenizer.perl ?

michaelauli commented 7 years ago

We used the pre-processed data from https://nlp.stanford.edu/projects/nmt/ for WMT14 English-German to be compatible with previous work who also used the same version of the data. Unfortunately, we do not have access to exact procedure on how the data was pre-processed. To be sure, you can check with the authors of the page above.

Zrachel commented 7 years ago

Thank you. @michaelauli , I've missed the data source from README.md. With the stanford data, I process data with:

  1. paste train.en train.de > train.en-de
  2. shuffle train.en-de, with head -44688 > valid.data, tail -n +44689 > train.data
  3. learn_bpe from train.data
  4. apply_bpe to train.data, valid.data and test(newstest2014)
  5. fairseq preprocess

It seems no difference to yours, while the pre-trained dictionary has | [en] Dictionary: 42243 types | [de] Dictionary: 43676 types , I get | [en] Dictionary: 36645 types | [en] temp/subword/train.en: 4424152 sents, 130458910 tokens, 0.01% replaced by | [en] temp/subword/valid.en: 44688 sents, 1316445 tokens, 0.01% replaced by | [en] temp/subword/test.en: 2737 sents, 73951 tokens, 0.02% replaced by | [en] Wrote preprocessed data to output | [de] Dictionary: 40973 types | [de] temp/subword/train.de: 4424152 sents, 133356636 tokens, 0.00% replaced by | [de] temp/subword/valid.de: 44688 sents, 1349279 tokens, 0.00% replaced by | [de] temp/subword/test.de: 2737 sents, 77538 tokens, 0.00% replaced by | [de] Wrote preprocessed data to output

anything else needs to pay attention to?

jgehring commented 7 years ago

How did you call fairseq preprocess? As you're limiting the vocabulary with BPE already, you should not specify any dictionary cutoff for fairseq preprocess.

Zrachel commented 7 years ago

You hit the point, after removing "-thresholdsrc 3 -thresholdtgt 3",

I get | [en] Dictionary: 41953 types | [en] temp/subword/train.en: 4424152 sents, 130458910 tokens, 0.00% replaced by | [en] temp/subword/valid.en: 44688 sents, 1316445 tokens, 0.00% replaced by | [en] temp/subword/test.en: 2737 sents, 73951 tokens, 0.00% replaced by | [en] Wrote preprocessed data to output | [de] Dictionary: 43309 types | [de] temp/subword/train.de: 4424152 sents, 133356636 tokens, 0.00% replaced by | [de] temp/subword/valid.de: 44688 sents, 1349279 tokens, 0.00% replaced by | [de] temp/subword/test.de: 2737 sents, 77538 tokens, 0.00% replaced by | [de] Wrote preprocessed data to output

Thank you :)

Zrachel commented 6 years ago

Another problem still make me fail in training wmt14 en-fr. Can you help me check my log on google drive? Thank you.

https://drive.google.com/open?id=1hbk9pCsMsiZP1eqcZ6fV7Ee-Vp0S2HeO

I've tried three times. Is it out of memory or something else?

michaelauli commented 6 years ago

Did you check that none of the sentences exceeds 1024 words? The error looks like the position lookup table is out of bounds.

Zrachel commented 6 years ago

Thank you, now I tried to clean-corpus-n.perl -ratio 1.5 temp/train.subword en fr temp/subword/train 1 400 after BPE, I'll wait for the result.

But i'm confused at WHY long tokens input leads to look up table OUT OF BOUND?

jgehring commented 6 years ago

This is simply a hard-coded constant in the code. You can change it to something larger if needed here: https://github.com/facebookresearch/fairseq/blob/master/fairseq/models/fconv_model.lua#L177

Zrachel commented 6 years ago

Ah, it's the position encoding lookup table, thank you.

Psycoy commented 2 years ago

Hey, actually it's still not answered why in Attention is All You Need (https://arxiv.org/abs/1706.03762) the training set claims to be 36M, while the actual size (before preprocessing) is 40.8M. Would anyone kindly help to explain that?

Psycoy commented 2 years ago

@Psycoy PS: In most papers, they would claim that the WMT2014 ENDE's training set has the size of 4.5M and WMT2014 ENFR has the size of 36M. While the actual sizes of the downloaded training sets are 4.5M (3.9M after preprocessing) for WMT2014 ENDE and 40.8M (35.8M after preprocessing) for WMT2014 ENFR respectively.

It seems that the size of WMT2014 ENDE before preprocessing and the size of WMT2014 ENFR after preprocessing are the numbers that are claimed by most of the papers, which is strange because the papers should have reported the sizes under the same conditions (both not preprocessed or both preprocessed). Here "preprocessing" means deleting the data samples that exceed a certain length, which will make the set smaller.

Here are the corpora I used for the two training sets (WMT2014 ENDE and WMT2014 ENFR), which is also the same as WMT2014 homepage:

WMT2014 ENDE:

  1. Europarl v7
  2. Common Crawl corpus
  3. News Commentary

WMT2014 ENFR:

  1. Europarl v7
  2. Common Crawl corpus
  3. UN corpus
  4. News Commentary
  5. giga French-English corpus

Any problems? Can anyone explain it?