issues with mbart models

mjpost commented 4 years ago

❓ Questions and Help

Thanks for releasing the mbart models! However, we are unable to produce the EN-RO fine-tuned BLEU scores reported in the paper. We get a BLEU score of 26.9, using sacreBLEU's default tokenization, v13. This is well below the 38.5 reported in the README and even below scores reported for WMT16. Here is a complete script to reproduce this; is there anything obvious we are doing wrong?

We have also tried to work with scoring the main, pretrained-only model, and were surprised to find that the names of the parameters seem to change between the main model and fine-tuned one. Perhaps documenting this is beyond the scope of your intentions with releasing the model, but it is a bit confusing when working with these models.

Code

Here is the code we run:

# constants
langs=ar_AR,cs_CZ,de_DE,en_XX,es_XX,et_EE,fi_FI,fr_XX,gu_IN,hi_IN,it_IT,ja_XX,kk_KZ,ko_KR,lt_LT,lv_LV,my_MM,ne_NP,nl_XX,ro_RO,ru_RU,si_LK,tr_TR,vi_VN,zh_CN
MODELDIR=MBART_finetuned_enro
DICT=$MODELDIR/dict.txt
export FAIRSEQ=~/code/fairseq
export PYTHONPATH=$FAIRSEQ
# end constants

SRC=en_XX
TRG=ro_RO

mkdir tmp
sacrebleu -t wmt16 -l en-ro --echo src | spm_encode --model $MODELDIR/sentence.bpe.model > tmp/data.spm.$SRC
sacrebleu -t wmt16 -l en-ro --echo ref | spm_encode --model $MODELDIR/sentence.bpe.model > tmp/data.spm.$TRG

python3 $FAIRSEQ/preprocess.py \
  --source-lang $SRC \
  --target-lang $TRG \
  --testpref tmp/data.spm  \
  --destdir tmp \
  --thresholdtgt 0 \
  --thresholdsrc 0 \
  --srcdict ${DICT} \
  --tgtdict ${DICT} \
  --workers 70

python3 $FAIRSEQ/generate.py $tmpdir \
  --path $MODELDIR/model.pt \
  --task translation_from_pretrained_bart \
  --gen-subset test \
  --max-tokens 1000 \
  -s $SRC \
  -t $TRG \
  --max-sentences 32 \
  --langs $langs > out.wmt19.ro

grep ^H out.wmt19.ro | sort -V | cut -f3 | spm_decode --model $MODELDIR/sentence.bpe.model | perl -pe 's/\[ro_RO\]//' | sacrebleu -t wmt16 -l en-ro -b

What's your environment?

fairseq Version (e.g., 1.0 or master): latest github
PyTorch Version (e.g., 1.0): 1.4.0
OS (e.g., Linux): CentOS 7.5
How you installed fairseq (pip, source): source
Build command you used (if compiling from source): pip install --editable . (within a conda env)
Python version: 3.7.5
CUDA/cuDNN version: 10.1 / 7.6.3
GPU models and configuration: Titan RTX
Any other relevant information:

yinhanliu commented 4 years ago

@mjpost Please use the tokenizer in the readme. Our output is not BPEed so you need to use the tokenizer to parse the data and then tokenize it.

mjpost commented 4 years ago

Thanks for the response. The README says "set tokenizer here". I presume this means to apply the remove-diacritics.py and normalise-romanian.py scripts in that repo. I will also guess that I should run the Moses tokenizer with the -l ro flag. Doing so:

$ cat out.debpe | ~/code/wmt16-scripts/preprocess/normalise-romanian.py | ~/code/wmt16-scripts/preprocess/remove-diacritics.py | ~/code/mosesdecoder/scripts/tokenizer/tokenizer.perl -l ro > out.tok
$ sacrebleu -t wmt16 -l en-ro --echo ref | ~/code/wmt16-scripts/preprocess/normalise-romanian.py | ~/code/wmt16-scripts/preprocess/remove-diacritics.py | /code/mosesdecoder/scripts/tokenizer/tokenizer.perl -l ro > ref.tok
$ sacrebleu -tok none -s none -b

gives 37.1, which is at least much closer to what's reported in the README.

yinhanliu commented 4 years ago

yinhanliu commented 4 years ago

@mjpost Please let me know if you can reproduce the number that I can close this issue.

mjpost commented 4 years ago

I didn't expect that this preprocessing would make such a different.

Running your exact command gives me 37.8:

#!/bin/bash

set -eu

REPLACE_UNICODE_PUNCT=$HOME/code/mosesdecoder/scripts/tokenizer/replace-unicode-punctuation.perl
NORM_PUNC=$HOME/code/mosesdecoder/scripts/tokenizer/normalize-punctuation.perl
REM_NON_PRINT_CHAR=$HOME/code/mosesdecoder/scripts/tokenizer/remove-non-printing-char.perl
REMOVE_DIACRITICS=$HOME/code/wmt16-scripts/preprocess/remove-diacritics.py
NORMALIZE_ROMANIAN=$HOME/code/wmt16-scripts/preprocess/normalise-romanian.py
TOKENIZER=$HOME/code/mosesdecoder/scripts/tokenizer/tokenizer.perl

sys=$1
ref=$2

lang=ro
for file in $sys $ref; do
  cat $file \
  | $REPLACE_UNICODE_PUNCT \
  | $NORM_PUNC -l $lang \
  | $REM_NON_PRINT_CHAR \
  | $NORMALIZE_ROMANIAN \
  | $REMOVE_DIACRITICS \
  | $TOKENIZER -no-escape -l $lang \
  > $(basename $file).tok
done

cat $(basename $sys).tok | sacrebleu -tok none -s none -b $(basename $ref).tok

I run this as:

# ./eval-enro.sh out.clean.wmt19.ro.txt wmt19.en-ro.ro.txt
37.8

with the attached files (the first one being the output).

wmt19.en-ro.ro.txt out.clean.wmt19.ro.txt

sshleifer commented 4 years ago

@mjpost This is very helpful!

I am having trouble following what the final 37.8 solution ended up being.

Is out.clean.wmt19.ro.txt generated from your original script or an intermediate result? Thanks!

mjpost commented 4 years ago

@sshleifer I believe it was generated from the fairseq model (so cat fairseq.out | grep -V H^ | cut -f3 | spm_decode).

I never reached their reported score but it was close.

sshleifer commented 4 years ago

Thanks! Did you ever get to the bottom of the difference between cc25 and en_ro models?

mjpost commented 4 years ago

No—we had thought it would be an easy comparison as a baseline in a project we're working on, but I couldn't figure it out after putting some time into it. They didn't respond to the point in my second paragraph above.

myleott commented 4 years ago

This seems not quite resolved, particularly: "the names of the parameters seem to change between the main model and fine-tuned one".

@yinhanliu or @MultiPath, can you share any insight on why the weights change between cc25 and en-ro?

mjpost commented 4 years ago

There are details of the error I ran into and how to reproduce it in #1754.

yinhanliu commented 4 years ago

https://github.com/pytorch/fairseq/blob/18831f9f8353e7b7902f4d9a651463f50f40ce3f/fairseq/models/bart/model.py#L248 this needs to be True to solve #1754

Sorry, I dont quite understand what you tried to do and what failed?

You tried to run generation on a pre-trained model? the pre-trained model is a de-noising model, it will copy src to tgt, it never learned translation.

mjpost commented 4 years ago

Yes, I understand it's just a de-noiser. But I should at least be able to run it. I gave an EN-DE example in #1754, but if I switch to EN-EN, it still fails.

How do I tell the model that args.layernorm_embedding is true? This isn't a command-line argument but appears to be an internal API model creation parameter. Why is this not stored in the model config itself?

(A similar source of confusion comes from having to pass the list of language codes, instead of just adding these to the model dictionary.)

yinhanliu commented 4 years ago

https://github.com/pytorch/fairseq/blob/7c0ab23d14882d77ae5017ee71085925c5c03373/fairseq/models/transformer.py#L161

you just passing the arg should be fine. the fine-tune command in Readme gives better info of how to use pre-trained mbart. The generate command is designed for translation model (fine-tuned) only.

mjpost commented 4 years ago

That is an argument for training. I am not trying to fine-tune the pretrained model (that works); instead, I am trying to use the pretrained model with fairseq-generate, which doesn't have a --layernorm-embedding flag.

mjpost commented 4 years ago

The generate command is designed for translation model (fine-tuned) only.

This was my original question. In principle there is no reason that I should not be able to decode with the pretrained model. My original question was to ask what parameters changed between the pretrained and fine-tuned models (i.e., what prevents the pretrained model from being used in decoding)?

mjpost commented 4 years ago

Here is an example where I am trying to use the pre-trained model (not fine-tuned) to find the auto-encoder score of an English input using `fairseq-generate. I can get it to work with the "translation" task. However, I am not sure that I have the correct results.

First, I have to clean the cc25_pretrain model. After rereading the paper, it seems the extra parameters are likely those mentioned in the "Architecture" paragraph in Section 2 of the paper:

We also include an additional layer-normalization layer on top of both the encoder and the decoder, which we found stabilized training at FP16 precision.

It is likely that the fine-tuning is just dropping these. It is easy to remove them with the following script.

import torch
d = torch.load("cc25_pretrain/model.pt")
for extra in ["encoder.layernorm_embedding.bias", "decoder.layernorm_embedding.weight", "decoder.layernorm_embedding.bias"]:
    if extra in d["model"]:
        dell d["model"][extra]
torch.save(d, "cc25_pretrain/model.pt")

Next, I manually add the language codes to the dictionary:

for code in [ar_AR] [cs_CZ] [de_DE] [en_XX] [es_XX] [et_EE] [fi_FI] [fr_XX] [gu_IN] [hi_IN] [it_IT] [ja_XX] [kk_KZ] [ko_KR] [lt_LT] [lv_LV] [my_MM] [ne_NP] [nl_XX] [ro_RO] [ru_RU] [si_LK] [tr_TR] [vi_VN] [zh_CN] "<mask>"; do
    echo "$code 1" >> cc25_pretrain/dict.txt
done

Now, the model will work with the "translation" task with fairseq-generate.

Suppose I would like to use the model to find the auto-encoder score for an English sentence. It seems one has to append the language codes to the start and end, like this:

S-8     ▁Other s ▁have ▁dis miss ed ▁him ▁as ▁a ▁joke . [en_XX]
T-8     [en_XX] ▁Other s ▁have ▁dis miss ed ▁him ▁as ▁a ▁joke .
H-8     -29.304781542170865     [en_XX] ▁Other s ▁have ▁dis miss ed ▁him ▁as ▁a ▁joke .
P-8     -45.5193 -36.3480 -17.6285 -33.7401 -18.9918 -41.1282 -45.5779 -44.3245 -14.6398 -14.6273 -33.6461 -22.1626 -12.6281

The probabilities here are very low. The vocabulary is quite large, but I would have expected the decoder prediction of the first token [en_XX] to be very high. Removing the language codes entirely produces much higher sentence-level scores:

S-8     ▁Other s ▁have ▁dis miss ed ▁him ▁as ▁a ▁joke .
T-8     ▁Other s ▁have ▁dis miss ed ▁him ▁as ▁a ▁joke .
H-8     -13.532117797144407     ▁Other s ▁have ▁dis miss ed ▁him ▁as ▁a ▁joke .
P-8     -30.6400 -2.8578 -18.5653 -10.2856 -16.6289 -16.3278 -17.9757 -5.5656 -3.2346 -20.1281 -9.9263 -10.2497Z

But it is hard to know what the right call is here.

Perhaps the language codes are used as the actual <bos> and <eos> tokens? In which case one would have to adapt the translation_from_pretrained_bart task. I have played around with this quite a bit (one has to take care, since that task does not permit auto-encoder scoring out of the box, but assumes you are using two distinct languages). I can do this work, but it would be very helpful to have some technical guidance here on exactly what the model expects. @yinhanliu, perhaps this would be an easy question for you to answer?

mjpost commented 4 years ago

Looking at translation_from_pretrained_bart, it seems that adding the source LID token to the end of the source sentence is correct, but that what I need is to use the target-side language code as the decoder BOS:

eos = None
if append_source_id:
    src_dataset = AppendTokenDataset(src_dataset, src_dict.index('[{}]'.format(src)))
    if tgt_dataset is not None:
        tgt_dataset = AppendTokenDataset(tgt_dataset, tgt_dict.index('[{}]'.format(tgt)))
    eos = tgt_dict.index('[{}]'.format(tgt))

I presume that appending here causes the target-side LID token to serve as EOS, and that I have to set it to BOS, too, so that the decoder context is properly initialized.

moussaKam commented 4 years ago

Hi @mjpost I'm trying to reproduce the work you have done to find the auto-encoder score.

I deleted the layernnorm embeddings layers:

import torch
d = torch.load("bart/cc25_pretrain/mbart.cc25/model.pt")
for extra in ["encoder.layernorm_embedding.weight", "encoder.layernorm_embedding.bias", "decoder.layernorm_embedding.weight", "decoder.layernorm_embedding.bias"]:
    if extra in d["model"]:
        del d["model"][extra]
torch.save(d, "bart/cc25_pretrain/mbart.cc25/model_no_layernorm_embedding.pt")

Then I updated the dictionary:

for code in [ar_AR] [cs_CZ] [de_DE] [en_XX] [es_XX] [et_EE] [fi_FI] [fr_XX] [gu_IN] [hi_IN] [it_IT] [ja_XX] [kk_KZ] [ko_KR] [lt_LT] [lv_LV] [my_MM] [ne_NP] [nl_XX] [ro_RO] [ru_RU] [si_LK] [tr_TR] [vi_VN] [zh_CN] "<mask>"; do
    echo "$code 1" >> cc25_pretrain/dict.txt
done

Then I pre-processed some english sentences:

fairseq-preprocess --trainpref sample_text --srcdict dict.txt --tgtdict dict.txt --destdir data-sample --source-lang source --target-lang target

Finally I translated using the pre-trained model:

fairseq-generate data-sample  --path $model  --task translation --gen-subset train -t target -s source --bpe 'sentencepiece' --sentencepiece-vocab ../sentence.bpe.model

However I am getting very weird output, for example:

S-0 How are you.[en_XX]
T-0 [en_XX] How are you.
H-0 -0.1974826455116272 ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa
D-0 -0.1974826455116272 coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa
P-0 -20.5911 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -19.0969

and:

S-1 What is your name
T-1 What is your name
H-1 -0.16524094343185425    ▁Home
D-1 -0.16524094343185425    Home
P-1 -0.0007 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -22.5516 -10.6590

Am I missing something?

mjpost commented 4 years ago

@moussaKam I haven't had time to look into this deeply enough to solve it, so this may be incorrect. But I believe the problem is that MBART models use the target-language token both as BOS and EOS for the decoder. That is, the decoder is initialized with the language tag as BOS, and terminates when it generates that tag. This creates the following set of problems:

You cannot actually set BOS. You can prefix decode to a token (i.e., force it to predict the target language token as the first target-side token), but not set it to BOS.
Things are complicated because fairseq uses the values of -s and -t both as indices into the prepacked data and, optionally, to initialize BOS and EOS. But if you are trying to paraphrase, as your example suggests, you have to use a dummy language code for the source sentence, since you need en_XX for the target.
I am also uncertain about using the unadapted model in this way. They train with an additional layernorm layer (which you've deleted above). But without adapting the model, it's not clear what effect removing this will have, since the pretraining made use of it.

In short, I think you have to modify the code to be able to use the model in this way, and it's not certain that, once the technical barriers are cleared, the model will do what you are hoping. But of course, that is what I was trying to test empirically (and the reason for this issue).

vince62s commented 4 years ago

@mjpost @yinhanliu @sshleifer I understand you guys trying to replicate the score of the paper but the paper is not comparable to the WMT6 Sennrich's results for EN-RO. 37.8 is computed on tokenized output and reference, and on top of this, there is some normalization / preprocessing on the reference which is not what they do for WMT.

In the mBart paper it says for RO "We apply Moses tokenization and special normalization for Romanian texts" following Sennrich 2016a but this is wrong to me.

Bottom line @mjpost your 26.9 is the WMT comparable score except if you prefer to normalize / remove diacritics in your output, but not in the reference.

EDIT: In the introduction of the paper it says: "These results further improve with backtranslation (BT), setting a new state-of-the-art on WMT16 English-Romanian and the FloRes test sets." Later at the end of section 3.2: "Moreover, combining BT leads to additional gains, resulting in a new state-of-the-art for Ro-En translation."

Ro-En, if really scored with detokenized sacrebleu, it might be SOTA. En-Ro, I doubt. (cf my comments above)

MultiPath commented 4 years ago

@vince62s Hi, for EN-RO, we were comparing with previous works where people compute tokenized BLEU on the normalized datasets. Thanks

vince62s commented 4 years ago

do you mind quoting these "previous works" ? thanks.

MultiPath commented 4 years ago

@vince62s For example, the unsupervised MT results in MASS (https://arxiv.org/pdf/1905.02450.pdf), XLM (https://arxiv.org/pdf/1901.07291.pdf); many works about non-autoregressive MT (e.g. https://arxiv.org/pdf/1909.02480v3.pdf, https://arxiv.org/pdf/1904.09324.pdf, ...)

MultiPath commented 4 years ago

@vince62s we are just following (Sennrich 2016a)'s scripts to post-process the output to get normalized/tokenized sentences. https://github.com/rsennrich/wmt16-scripts I don't know how they evaluated at WMT16 official completion, however, many papers followed their scripts to process the data and computed tokenized BLEU scores. If you wanted me to dig all the related papers, it will take some time.

However, in my view, it is a fair way of comparison as mBART does not do preprocessing during training.

Thanks

vince62s commented 4 years ago

@MultiPath I don't want to sound too critical because the mBart paper is great. However, the EN-RO statement is wrong. Just an example, all your reported WMT17/WMT18/WMT19 scores for other language pairs are by far below the state-of-the-art (transformer) or such, just look at WMT official scores.

For EN<>RO I ran Google translate is gives RO>EN 43.3, EN>RO 32.7

So how can your EN>RO be legit ? you would be by far state-of-art on all other language pairs as well.

When you say you just used Sennrich's scripts, nobody contests this, it's just that you don't have the "right" to normalize/touch the reference. You can do whatever you want on the training data, the output of your system, but in the end, you need to score a detokenized output and submit it to sacrebleu or a scorer that uses the Nist/13a/14 tokenization to be comparable, using the reference "as is".

@mjpost who is well aware about sacrebleu may confirm.

vince62s commented 4 years ago

in addition, for those who might be interested in the original paper of Rico: https://www.aclweb.org/anthology/W16-2323.pdf it says clearly:

"We found that the use of diacritics was inconsistent in the Romanian training (and development) data, so for Romanian→English we removed diacritics from the Romanian source side, obtaining improvements of 1.3–1.4 BLEU. Synthetic training data gives improvements of 4.1–5.1 BLEU. for English→Romanian, we found that the best single system outperformed the ensemble of the last 4 checkpoints on dev, and we thus submitted the best single system as primary system."

=> no diacritics removal in the EN > RO experiment. You can check the output of Rico here: http://matrix.statmt.org/matrix/output/1843?run_id=4303

cbaziotis commented 4 years ago

Here is an example where I am trying to use the pre-trained model (not fine-tuned) to find the auto-encoder score of an English input using `fairseq-generate. I can get it to work with the "translation" task. However, I am not sure that I have the correct results.

First, I have to clean the cc25_pretrain model. After rereading the paper, it seems the extra parameters are likely those mentioned in the "Architecture" paragraph in Section 2 of the paper:

We also include an additional layer-normalization layer on top of both the encoder and the decoder, which we found stabilized training at FP16 precision.

It is likely that the fine-tuning is just dropping these. It is easy to remove them with the following script.
import torch
d = torch.load("cc25_pretrain/model.pt")
for extra in ["encoder.layernorm_embedding.bias", "decoder.layernorm_embedding.weight", "decoder.layernorm_embedding.bias"]:
    if extra in d["model"]:
        dell d["model"][extra]
torch.save(d, "cc25_pretrain/model.pt")
Next, I manually add the language codes to the dictionary:
for code in [ar_AR] [cs_CZ] [de_DE] [en_XX] [es_XX] [et_EE] [fi_FI] [fr_XX] [gu_IN] [hi_IN] [it_IT] [ja_XX] [kk_KZ] [ko_KR] [lt_LT] [lv_LV] [my_MM] [ne_NP] [nl_XX] [ro_RO] [ru_RU] [si_LK] [tr_TR] [vi_VN] [zh_CN] "<mask>"; do
    echo "$code 1" >> cc25_pretrain/dict.txt
done
Now, the model will work with the "translation" task with fairseq-generate.

Suppose I would like to use the model to find the auto-encoder score for an English sentence. It seems one has to append the language codes to the start and end, like this:
S-8     ▁Other s ▁have ▁dis miss ed ▁him ▁as ▁a ▁joke . [en_XX]
T-8     [en_XX] ▁Other s ▁have ▁dis miss ed ▁him ▁as ▁a ▁joke .
H-8     -29.304781542170865     [en_XX] ▁Other s ▁have ▁dis miss ed ▁him ▁as ▁a ▁joke .
P-8     -45.5193 -36.3480 -17.6285 -33.7401 -18.9918 -41.1282 -45.5779 -44.3245 -14.6398 -14.6273 -33.6461 -22.1626 -12.6281
The probabilities here are very low. The vocabulary is quite large, but I would have expected the decoder prediction of the first token [en_XX] to be very high. Removing the language codes entirely produces much higher sentence-level scores:
S-8     ▁Other s ▁have ▁dis miss ed ▁him ▁as ▁a ▁joke .
T-8     ▁Other s ▁have ▁dis miss ed ▁him ▁as ▁a ▁joke .
H-8     -13.532117797144407     ▁Other s ▁have ▁dis miss ed ▁him ▁as ▁a ▁joke .
P-8     -30.6400 -2.8578 -18.5653 -10.2856 -16.6289 -16.3278 -17.9757 -5.5656 -3.2346 -20.1281 -9.9263 -10.2497Z
But it is hard to know what the right call is here.

Perhaps the language codes are used as the actual <bos> and <eos> tokens? In which case one would have to adapt the translation_from_pretrained_bart task. I have played around with this quite a bit (one has to take care, since that task does not permit auto-encoder scoring out of the box, but assumes you are using two distinct languages). I can do this work, but it would be very helpful to have some technical guidance here on exactly what the model expects. @yinhanliu, perhaps this would be an easy question for you to answer?

I think the fact that the probability of the langid token is low might be related to this. I don't know how the model was trained, but based on the implementation of multilingual_denoising on master, the langid token is not used as the BOS in the input to the decoder.

SunbowLiu commented 4 years ago

I didn't expect that this preprocessing would make such a different.

Running your exact command gives me 37.8:

#!/bin/bash

set -eu

REPLACE_UNICODE_PUNCT=$HOME/code/mosesdecoder/scripts/tokenizer/replace-unicode-punctuation.perl
NORM_PUNC=$HOME/code/mosesdecoder/scripts/tokenizer/normalize-punctuation.perl
REM_NON_PRINT_CHAR=$HOME/code/mosesdecoder/scripts/tokenizer/remove-non-printing-char.perl
REMOVE_DIACRITICS=$HOME/code/wmt16-scripts/preprocess/remove-diacritics.py
NORMALIZE_ROMANIAN=$HOME/code/wmt16-scripts/preprocess/normalise-romanian.py
TOKENIZER=$HOME/code/mosesdecoder/scripts/tokenizer/tokenizer.perl

sys=$1
ref=$2

lang=ro
for file in $sys $ref; do
  cat $file \
  | $REPLACE_UNICODE_PUNCT \
  | $NORM_PUNC -l $lang \
  | $REM_NON_PRINT_CHAR \
  | $NORMALIZE_ROMANIAN \
  | $REMOVE_DIACRITICS \
  | $TOKENIZER -no-escape -l $lang \
  > $(basename $file).tok
done

cat $(basename $sys).tok | sacrebleu -tok none -s none -b $(basename $ref).tok

I run this as:

# ./eval-enro.sh out.clean.wmt19.ro.txt wmt19.en-ro.ro.txt
37.8

with the attached files (the first one being the output).

wmt19.en-ro.ro.txt out.clean.wmt19.ro.txt

Hi Matt,

Thank you for your script which really helps me a lot. Following what you did, I can reproduce a BLEU score of 37.5, which is still 1 BLEU score lower than the paper and 0.3 lower than your result.

I have three questions about the issue:

Should I also normalize the testing input (i.e., English text) with:

| $REPLACE_UNICODE_PUNCT \ | $NORM_PUNC -l $lang \ | $REM_NON_PRINT_CHAR \ | $TOKENIZER -no-escape -l $lang \ $(basename $file).tok
Should I truecase the test sets?
Have you successfully reproduced the result of 38.5 BLEU score? if so, could you share the key tricks?

Thanks, Xuebo Liu

mjpost commented 4 years ago

Hi @SunbowLiu, I don't know the answers to your questions and gave up on trying to exactly recreate their scores.

SunbowLiu commented 4 years ago

Hi @vince62s,

I totally agree with your concerns about the tokenization of the RO reference and also somewhat doubt about the real fine-tuning performance.

But after finetuning the RO-EN model by myself over mbart.cc with the WMT16 training data, I can successfully reproduce a sarceBLEU score of 37.5 with the command cat test.hyp | sacrebleu -t wmt16 -l ro-en. Noted that test.hyp is detokenized text which makes the evaluation fully the same to the official WMT. The result indeed outperforms previous SOTA, confirming the effectiveness of mbart.

sshleifer commented 4 years ago

~@SunbowLiu I am also trying to finetune with torch 1.5.1, apex, fairseq on master and getting this issue https://github.com/NVIDIA/apex/issues/161 . What versions are you running?~ fixed with https://github.com/NVIDIA/apex/issues/161#issuecomment-646888385

vince62s commented 4 years ago

@SunbowLiu my point on diacritics is on EN to RO (not the reverse). Of course you can beat some WMT16 results with RO to EN while at that time it was RNN time ....

luofuli commented 4 years ago

Hi @vince62s,

I totally agree with your concerns about the tokenization of the RO reference and also somewhat doubt about the real fine-tuning performance.

But after finetuning the RO-EN model by myself over mbart.cc with the WMT16 training data, I can successfully reproduce a sarceBLEU score of 37.5 with the command cat test.hyp | sacrebleu -t wmt16 -l ro-en. Noted that test.hyp is detokenized text which makes the evaluation fully the same to the official WMT. The result indeed outperforms previous SOTA, confirming the effectiveness of mbart.

@SunbowLiu Do you mean that 37.5 is the sarceBLEU of WMT16 RO-EN? How about the sarceBLEU of WMT16 RO-EN with back-translation?

yongchanghao commented 4 years ago

@moussaKam I haven't had time to look into this deeply enough to solve it, so this may be incorrect. But I believe the problem is that MBART models use the target-language token both as BOS and EOS for the decoder. That is, the decoder is initialized with the language tag as BOS, and terminates when it generates that tag. This creates the following set of problems:

You cannot actually set BOS. You can prefix decode to a token (i.e., force it to predict the target language token as the first target-side token), but not set it to BOS.

Things are complicated because fairseq uses the values of -s and -t both as indices into the prepacked data and, optionally, to initialize BOS and EOS. But if you are trying to paraphrase, as your example suggests, you have to use a dummy language code for the source sentence, since you need en_XX for the target.

I am also uncertain about using the unadapted model in this way. They train with an additional layernorm layer (which you've deleted above). But without adapting the model, it's not clear what effect removing this will have, since the pretraining made use of it.

In short, I think you have to modify the code to be able to use the model in this way, and it's not certain that, once the technical barriers are cleared, the model will do what you are hoping. But of course, that is what I was trying to test empirically (and the reason for this issue).

Hi Matt, I've tried to feed the target [en_XX] as the first input token of decoder, However, this could not make any differences. The pre-trained mbart even gets extremely low probability (lower than uniform distribution) yielded by force-decoding. Do you have any ideas about this? Than you.

JunjieHu commented 3 years ago

Hi @vince62s,

I totally agree with your concerns about the tokenization of the RO reference and also somewhat doubt about the real fine-tuning performance.

But after finetuning the RO-EN model by myself over mbart.cc with the WMT16 training data, I can successfully reproduce a sarceBLEU score of 37.5 with the command cat test.hyp | sacrebleu -t wmt16 -l ro-en. Noted that test.hyp is detokenized text which makes the evaluation fully the same to the official WMT. The result indeed outperforms previous SOTA, confirming the effectiveness of mbart.

Hi @SunbowLiu Could you share the fine-tuning script? Do you have results on fine-tuning mbart for EN-RO?

stale[bot] commented 3 years ago

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

stale[bot] commented 2 years ago

Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!

facebookresearch / fairseq