Closed mjpost closed 2 years ago
@mjpost Please use the tokenizer in the readme. Our output is not BPEed so you need to use the tokenizer to parse the data and then tokenize it.
Thanks for the response. The README says "set tokenizer here". I presume this means to apply the remove-diacritics.py
and normalise-romanian.py
scripts in that repo. I will also guess that I should run the Moses tokenizer with the -l ro
flag. Doing so:
$ cat out.debpe | ~/code/wmt16-scripts/preprocess/normalise-romanian.py | ~/code/wmt16-scripts/preprocess/remove-diacritics.py | ~/code/mosesdecoder/scripts/tokenizer/tokenizer.perl -l ro > out.tok
$ sacrebleu -t wmt16 -l en-ro --echo ref | ~/code/wmt16-scripts/preprocess/normalise-romanian.py | ~/code/wmt16-scripts/preprocess/remove-diacritics.py | /code/mosesdecoder/scripts/tokenizer/tokenizer.perl -l ro > ref.tok
$ sacrebleu -tok none -s none -b
gives 37.1, which is at least much closer to what's reported in the README.
@mjpost Very close now. still miss three steps before normalization. see below lg=ro $REPLACE_UNICODE_PUNCT | $NORM_PUNC -l $lg | $REM_NON_PRINT_CHAR | $NORMALIZE_ROMANIAN | $REMOVE_DIACRITICS | $TOKENIZER -no-escape -threads $N_THREADS -l $lg
@mjpost Please let me know if you can reproduce the number that I can close this issue.
I didn't expect that this preprocessing would make such a different.
Running your exact command gives me 37.8:
#!/bin/bash
set -eu
REPLACE_UNICODE_PUNCT=$HOME/code/mosesdecoder/scripts/tokenizer/replace-unicode-punctuation.perl
NORM_PUNC=$HOME/code/mosesdecoder/scripts/tokenizer/normalize-punctuation.perl
REM_NON_PRINT_CHAR=$HOME/code/mosesdecoder/scripts/tokenizer/remove-non-printing-char.perl
REMOVE_DIACRITICS=$HOME/code/wmt16-scripts/preprocess/remove-diacritics.py
NORMALIZE_ROMANIAN=$HOME/code/wmt16-scripts/preprocess/normalise-romanian.py
TOKENIZER=$HOME/code/mosesdecoder/scripts/tokenizer/tokenizer.perl
sys=$1
ref=$2
lang=ro
for file in $sys $ref; do
cat $file \
| $REPLACE_UNICODE_PUNCT \
| $NORM_PUNC -l $lang \
| $REM_NON_PRINT_CHAR \
| $NORMALIZE_ROMANIAN \
| $REMOVE_DIACRITICS \
| $TOKENIZER -no-escape -l $lang \
> $(basename $file).tok
done
cat $(basename $sys).tok | sacrebleu -tok none -s none -b $(basename $ref).tok
I run this as:
# ./eval-enro.sh out.clean.wmt19.ro.txt wmt19.en-ro.ro.txt
37.8
with the attached files (the first one being the output).
@mjpost This is very helpful!
I am having trouble following what the final 37.8 solution ended up being.
Is out.clean.wmt19.ro.txt
generated from your original script or an intermediate result?
Thanks!
@sshleifer I believe it was generated from the fairseq model (so cat fairseq.out | grep -V H^ | cut -f3 | spm_decode
).
I never reached their reported score but it was close.
Thanks! Did you ever get to the bottom of the difference between cc25 and en_ro models?
No—we had thought it would be an easy comparison as a baseline in a project we're working on, but I couldn't figure it out after putting some time into it. They didn't respond to the point in my second paragraph above.
This seems not quite resolved, particularly: "the names of the parameters seem to change between the main model and fine-tuned one".
@yinhanliu or @MultiPath, can you share any insight on why the weights change between cc25 and en-ro?
There are details of the error I ran into and how to reproduce it in #1754.
https://github.com/pytorch/fairseq/blob/18831f9f8353e7b7902f4d9a651463f50f40ce3f/fairseq/models/bart/model.py#L248 this needs to be True to solve #1754
Sorry, I dont quite understand what you tried to do and what failed?
You tried to run generation on a pre-trained model? the pre-trained model is a de-noising model, it will copy src to tgt, it never learned translation.
Yes, I understand it's just a de-noiser. But I should at least be able to run it. I gave an EN-DE example in #1754, but if I switch to EN-EN, it still fails.
How do I tell the model that args.layernorm_embedding
is true? This isn't a command-line argument but appears to be an internal API model creation parameter. Why is this not stored in the model config itself?
(A similar source of confusion comes from having to pass the list of language codes, instead of just adding these to the model dictionary.)
you just passing the arg should be fine. the fine-tune command in Readme gives better info of how to use pre-trained mbart. The generate command is designed for translation model (fine-tuned) only.
That is an argument for training. I am not trying to fine-tune the pretrained model (that works); instead, I am trying to use the pretrained model with fairseq-generate
, which doesn't have a --layernorm-embedding
flag.
The generate command is designed for translation model (fine-tuned) only.
This was my original question. In principle there is no reason that I should not be able to decode with the pretrained model. My original question was to ask what parameters changed between the pretrained and fine-tuned models (i.e., what prevents the pretrained model from being used in decoding)?
Here is an example where I am trying to use the pre-trained model (not fine-tuned) to find the auto-encoder score of an English input using `fairseq-generate. I can get it to work with the "translation" task. However, I am not sure that I have the correct results.
First, I have to clean the cc25_pretrain
model. After rereading the paper, it seems the extra parameters are likely those mentioned in the "Architecture" paragraph in Section 2 of the paper:
We also include an additional layer-normalization layer on top of both the encoder and the decoder, which we found stabilized training at FP16 precision.
It is likely that the fine-tuning is just dropping these. It is easy to remove them with the following script.
import torch
d = torch.load("cc25_pretrain/model.pt")
for extra in ["encoder.layernorm_embedding.bias", "decoder.layernorm_embedding.weight", "decoder.layernorm_embedding.bias"]:
if extra in d["model"]:
dell d["model"][extra]
torch.save(d, "cc25_pretrain/model.pt")
Next, I manually add the language codes to the dictionary:
for code in [ar_AR] [cs_CZ] [de_DE] [en_XX] [es_XX] [et_EE] [fi_FI] [fr_XX] [gu_IN] [hi_IN] [it_IT] [ja_XX] [kk_KZ] [ko_KR] [lt_LT] [lv_LV] [my_MM] [ne_NP] [nl_XX] [ro_RO] [ru_RU] [si_LK] [tr_TR] [vi_VN] [zh_CN] "<mask>"; do
echo "$code 1" >> cc25_pretrain/dict.txt
done
Now, the model will work with the "translation" task with fairseq-generate
.
Suppose I would like to use the model to find the auto-encoder score for an English sentence. It seems one has to append the language codes to the start and end, like this:
S-8 ▁Other s ▁have ▁dis miss ed ▁him ▁as ▁a ▁joke . [en_XX]
T-8 [en_XX] ▁Other s ▁have ▁dis miss ed ▁him ▁as ▁a ▁joke .
H-8 -29.304781542170865 [en_XX] ▁Other s ▁have ▁dis miss ed ▁him ▁as ▁a ▁joke .
P-8 -45.5193 -36.3480 -17.6285 -33.7401 -18.9918 -41.1282 -45.5779 -44.3245 -14.6398 -14.6273 -33.6461 -22.1626 -12.6281
The probabilities here are very low. The vocabulary is quite large, but I would have expected the decoder prediction of the first token [en_XX]
to be very high. Removing the language codes entirely produces much higher sentence-level scores:
S-8 ▁Other s ▁have ▁dis miss ed ▁him ▁as ▁a ▁joke .
T-8 ▁Other s ▁have ▁dis miss ed ▁him ▁as ▁a ▁joke .
H-8 -13.532117797144407 ▁Other s ▁have ▁dis miss ed ▁him ▁as ▁a ▁joke .
P-8 -30.6400 -2.8578 -18.5653 -10.2856 -16.6289 -16.3278 -17.9757 -5.5656 -3.2346 -20.1281 -9.9263 -10.2497Z
But it is hard to know what the right call is here.
Perhaps the language codes are used as the actual <bos>
and <eos>
tokens? In which case one would have to adapt the translation_from_pretrained_bart
task. I have played around with this quite a bit (one has to take care, since that task does not permit auto-encoder scoring out of the box, but assumes you are using two distinct languages). I can do this work, but it would be very helpful to have some technical guidance here on exactly what the model expects. @yinhanliu, perhaps this would be an easy question for you to answer?
Looking at translation_from_pretrained_bart
, it seems that adding the source LID token to the end of the source sentence is correct, but that what I need is to use the target-side language code as the decoder BOS:
eos = None
if append_source_id:
src_dataset = AppendTokenDataset(src_dataset, src_dict.index('[{}]'.format(src)))
if tgt_dataset is not None:
tgt_dataset = AppendTokenDataset(tgt_dataset, tgt_dict.index('[{}]'.format(tgt)))
eos = tgt_dict.index('[{}]'.format(tgt))
I presume that appending here causes the target-side LID token to serve as EOS, and that I have to set it to BOS, too, so that the decoder context is properly initialized.
Hi @mjpost I'm trying to reproduce the work you have done to find the auto-encoder score.
I deleted the layernnorm embeddings layers:
import torch
d = torch.load("bart/cc25_pretrain/mbart.cc25/model.pt")
for extra in ["encoder.layernorm_embedding.weight", "encoder.layernorm_embedding.bias", "decoder.layernorm_embedding.weight", "decoder.layernorm_embedding.bias"]:
if extra in d["model"]:
del d["model"][extra]
torch.save(d, "bart/cc25_pretrain/mbart.cc25/model_no_layernorm_embedding.pt")
Then I updated the dictionary:
for code in [ar_AR] [cs_CZ] [de_DE] [en_XX] [es_XX] [et_EE] [fi_FI] [fr_XX] [gu_IN] [hi_IN] [it_IT] [ja_XX] [kk_KZ] [ko_KR] [lt_LT] [lv_LV] [my_MM] [ne_NP] [nl_XX] [ro_RO] [ru_RU] [si_LK] [tr_TR] [vi_VN] [zh_CN] "<mask>"; do
echo "$code 1" >> cc25_pretrain/dict.txt
done
Then I pre-processed some english sentences:
fairseq-preprocess --trainpref sample_text --srcdict dict.txt --tgtdict dict.txt --destdir data-sample --source-lang source --target-lang target
Finally I translated using the pre-trained model:
fairseq-generate data-sample --path $model --task translation --gen-subset train -t target -s source --bpe 'sentencepiece' --sentencepiece-vocab ../sentence.bpe.model
However I am getting very weird output, for example:
S-0 How are you.[en_XX]
T-0 [en_XX] How are you.
H-0 -0.1974826455116272 ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa ▁coa
D-0 -0.1974826455116272 coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa coa
P-0 -20.5911 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -19.0969
and:
S-1 What is your name
T-1 What is your name
H-1 -0.16524094343185425 ▁Home
D-1 -0.16524094343185425 Home
P-1 -0.0007 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -22.5516 -10.6590
Am I missing something?
@moussaKam I haven't had time to look into this deeply enough to solve it, so this may be incorrect. But I believe the problem is that MBART models use the target-language token both as BOS and EOS for the decoder. That is, the decoder is initialized with the language tag as BOS, and terminates when it generates that tag. This creates the following set of problems:
You cannot actually set BOS. You can prefix decode to a token (i.e., force it to predict the target language token as the first target-side token), but not set it to BOS.
Things are complicated because fairseq uses the values of -s
and -t
both as indices into the prepacked data and, optionally, to initialize BOS and EOS. But if you are trying to paraphrase, as your example suggests, you have to use a dummy language code for the source sentence, since you need en_XX
for the target.
I am also uncertain about using the unadapted model in this way. They train with an additional layernorm layer (which you've deleted above). But without adapting the model, it's not clear what effect removing this will have, since the pretraining made use of it.
In short, I think you have to modify the code to be able to use the model in this way, and it's not certain that, once the technical barriers are cleared, the model will do what you are hoping. But of course, that is what I was trying to test empirically (and the reason for this issue).
@mjpost @yinhanliu @sshleifer I understand you guys trying to replicate the score of the paper but the paper is not comparable to the WMT6 Sennrich's results for EN-RO. 37.8 is computed on tokenized output and reference, and on top of this, there is some normalization / preprocessing on the reference which is not what they do for WMT.
In the mBart paper it says for RO "We apply Moses tokenization and special normalization for Romanian texts" following Sennrich 2016a but this is wrong to me.
Bottom line @mjpost your 26.9 is the WMT comparable score except if you prefer to normalize / remove diacritics in your output, but not in the reference.
EDIT: In the introduction of the paper it says: "These results further improve with backtranslation (BT), setting a new state-of-the-art on WMT16 English-Romanian and the FloRes test sets." Later at the end of section 3.2: "Moreover, combining BT leads to additional gains, resulting in a new state-of-the-art for Ro-En translation."
Ro-En, if really scored with detokenized sacrebleu, it might be SOTA. En-Ro, I doubt. (cf my comments above)
@vince62s Hi, for EN-RO, we were comparing with previous works where people compute tokenized BLEU on the normalized datasets. Thanks
do you mind quoting these "previous works" ? thanks.
@vince62s For example, the unsupervised MT results in MASS (https://arxiv.org/pdf/1905.02450.pdf), XLM (https://arxiv.org/pdf/1901.07291.pdf); many works about non-autoregressive MT (e.g. https://arxiv.org/pdf/1909.02480v3.pdf, https://arxiv.org/pdf/1904.09324.pdf, ...)
@vince62s we are just following (Sennrich 2016a)'s scripts to post-process the output to get normalized/tokenized sentences. https://github.com/rsennrich/wmt16-scripts I don't know how they evaluated at WMT16 official completion, however, many papers followed their scripts to process the data and computed tokenized BLEU scores. If you wanted me to dig all the related papers, it will take some time.
However, in my view, it is a fair way of comparison as mBART does not do preprocessing during training.
Thanks
@MultiPath I don't want to sound too critical because the mBart paper is great. However, the EN-RO statement is wrong. Just an example, all your reported WMT17/WMT18/WMT19 scores for other language pairs are by far below the state-of-the-art (transformer) or such, just look at WMT official scores.
For EN<>RO I ran Google translate is gives RO>EN 43.3, EN>RO 32.7
So how can your EN>RO be legit ? you would be by far state-of-art on all other language pairs as well.
When you say you just used Sennrich's scripts, nobody contests this, it's just that you don't have the "right" to normalize/touch the reference. You can do whatever you want on the training data, the output of your system, but in the end, you need to score a detokenized output and submit it to sacrebleu or a scorer that uses the Nist/13a/14 tokenization to be comparable, using the reference "as is".
@mjpost who is well aware about sacrebleu may confirm.
in addition, for those who might be interested in the original paper of Rico: https://www.aclweb.org/anthology/W16-2323.pdf it says clearly:
"We found that the use of diacritics was inconsistent in the Romanian training (and development) data, so for Romanian→English we removed diacritics from the Romanian source side, obtaining improvements of 1.3–1.4 BLEU. Synthetic training data gives improvements of 4.1–5.1 BLEU. for English→Romanian, we found that the best single system outperformed the ensemble of the last 4 checkpoints on dev, and we thus submitted the best single system as primary system."
=> no diacritics removal in the EN > RO experiment. You can check the output of Rico here: http://matrix.statmt.org/matrix/output/1843?run_id=4303
Here is an example where I am trying to use the pre-trained model (not fine-tuned) to find the auto-encoder score of an English input using `fairseq-generate. I can get it to work with the "translation" task. However, I am not sure that I have the correct results.
First, I have to clean the
cc25_pretrain
model. After rereading the paper, it seems the extra parameters are likely those mentioned in the "Architecture" paragraph in Section 2 of the paper:We also include an additional layer-normalization layer on top of both the encoder and the decoder, which we found stabilized training at FP16 precision.
It is likely that the fine-tuning is just dropping these. It is easy to remove them with the following script.
import torch d = torch.load("cc25_pretrain/model.pt") for extra in ["encoder.layernorm_embedding.bias", "decoder.layernorm_embedding.weight", "decoder.layernorm_embedding.bias"]: if extra in d["model"]: dell d["model"][extra] torch.save(d, "cc25_pretrain/model.pt")
Next, I manually add the language codes to the dictionary:
for code in [ar_AR] [cs_CZ] [de_DE] [en_XX] [es_XX] [et_EE] [fi_FI] [fr_XX] [gu_IN] [hi_IN] [it_IT] [ja_XX] [kk_KZ] [ko_KR] [lt_LT] [lv_LV] [my_MM] [ne_NP] [nl_XX] [ro_RO] [ru_RU] [si_LK] [tr_TR] [vi_VN] [zh_CN] "<mask>"; do echo "$code 1" >> cc25_pretrain/dict.txt done
Now, the model will work with the "translation" task with
fairseq-generate
.Suppose I would like to use the model to find the auto-encoder score for an English sentence. It seems one has to append the language codes to the start and end, like this:
S-8 ▁Other s ▁have ▁dis miss ed ▁him ▁as ▁a ▁joke . [en_XX] T-8 [en_XX] ▁Other s ▁have ▁dis miss ed ▁him ▁as ▁a ▁joke . H-8 -29.304781542170865 [en_XX] ▁Other s ▁have ▁dis miss ed ▁him ▁as ▁a ▁joke . P-8 -45.5193 -36.3480 -17.6285 -33.7401 -18.9918 -41.1282 -45.5779 -44.3245 -14.6398 -14.6273 -33.6461 -22.1626 -12.6281
The probabilities here are very low. The vocabulary is quite large, but I would have expected the decoder prediction of the first token
[en_XX]
to be very high. Removing the language codes entirely produces much higher sentence-level scores:S-8 ▁Other s ▁have ▁dis miss ed ▁him ▁as ▁a ▁joke . T-8 ▁Other s ▁have ▁dis miss ed ▁him ▁as ▁a ▁joke . H-8 -13.532117797144407 ▁Other s ▁have ▁dis miss ed ▁him ▁as ▁a ▁joke . P-8 -30.6400 -2.8578 -18.5653 -10.2856 -16.6289 -16.3278 -17.9757 -5.5656 -3.2346 -20.1281 -9.9263 -10.2497Z
But it is hard to know what the right call is here.
Perhaps the language codes are used as the actual
<bos>
and<eos>
tokens? In which case one would have to adapt thetranslation_from_pretrained_bart
task. I have played around with this quite a bit (one has to take care, since that task does not permit auto-encoder scoring out of the box, but assumes you are using two distinct languages). I can do this work, but it would be very helpful to have some technical guidance here on exactly what the model expects. @yinhanliu, perhaps this would be an easy question for you to answer?
I think the fact that the probability of the langid
token is low might be related to this. I don't know how the model was trained, but based on the implementation of multilingual_denoising
on master, the langid
token is not used as the BOS
in the input to the decoder.
I didn't expect that this preprocessing would make such a different.
Running your exact command gives me 37.8:
#!/bin/bash set -eu REPLACE_UNICODE_PUNCT=$HOME/code/mosesdecoder/scripts/tokenizer/replace-unicode-punctuation.perl NORM_PUNC=$HOME/code/mosesdecoder/scripts/tokenizer/normalize-punctuation.perl REM_NON_PRINT_CHAR=$HOME/code/mosesdecoder/scripts/tokenizer/remove-non-printing-char.perl REMOVE_DIACRITICS=$HOME/code/wmt16-scripts/preprocess/remove-diacritics.py NORMALIZE_ROMANIAN=$HOME/code/wmt16-scripts/preprocess/normalise-romanian.py TOKENIZER=$HOME/code/mosesdecoder/scripts/tokenizer/tokenizer.perl sys=$1 ref=$2 lang=ro for file in $sys $ref; do cat $file \ | $REPLACE_UNICODE_PUNCT \ | $NORM_PUNC -l $lang \ | $REM_NON_PRINT_CHAR \ | $NORMALIZE_ROMANIAN \ | $REMOVE_DIACRITICS \ | $TOKENIZER -no-escape -l $lang \ > $(basename $file).tok done cat $(basename $sys).tok | sacrebleu -tok none -s none -b $(basename $ref).tok
I run this as:
# ./eval-enro.sh out.clean.wmt19.ro.txt wmt19.en-ro.ro.txt 37.8
with the attached files (the first one being the output).
Hi Matt,
Thank you for your script which really helps me a lot. Following what you did, I can reproduce a BLEU score of 37.5, which is still 1 BLEU score lower than the paper and 0.3 lower than your result.
I have three questions about the issue:
Should I also normalize the testing input (i.e., English text) with:
| $REPLACE_UNICODE_PUNCT \ | $NORM_PUNC -l $lang \ | $REM_NON_PRINT_CHAR \ | $TOKENIZER -no-escape -l $lang \ $(basename $file).tok
Should I truecase the test sets?
Have you successfully reproduced the result of 38.5 BLEU score? if so, could you share the key tricks?
Thanks, Xuebo Liu
Hi @SunbowLiu, I don't know the answers to your questions and gave up on trying to exactly recreate their scores.
Hi @vince62s,
I totally agree with your concerns about the tokenization of the RO reference and also somewhat doubt about the real fine-tuning performance.
But after finetuning the RO-EN model by myself over mbart.cc
with the WMT16 training data, I can successfully reproduce a sarceBLEU score of 37.5 with the command cat test.hyp | sacrebleu -t wmt16 -l ro-en
. Noted that test.hyp
is detokenized text which makes the evaluation fully the same to the official WMT. The result indeed outperforms previous SOTA, confirming the effectiveness of mbart.
~@SunbowLiu I am also trying to finetune with torch 1.5.1, apex, fairseq on master and getting this issue https://github.com/NVIDIA/apex/issues/161 . What versions are you running?~ fixed with https://github.com/NVIDIA/apex/issues/161#issuecomment-646888385
@SunbowLiu my point on diacritics is on EN to RO (not the reverse). Of course you can beat some WMT16 results with RO to EN while at that time it was RNN time ....
Hi @vince62s,
I totally agree with your concerns about the tokenization of the RO reference and also somewhat doubt about the real fine-tuning performance.
But after finetuning the RO-EN model by myself over
mbart.cc
with the WMT16 training data, I can successfully reproduce a sarceBLEU score of 37.5 with the commandcat test.hyp | sacrebleu -t wmt16 -l ro-en
. Noted thattest.hyp
is detokenized text which makes the evaluation fully the same to the official WMT. The result indeed outperforms previous SOTA, confirming the effectiveness of mbart.
@SunbowLiu Do you mean that 37.5 is the sarceBLEU of WMT16 RO-EN? How about the sarceBLEU of WMT16 RO-EN with back-translation?
@moussaKam I haven't had time to look into this deeply enough to solve it, so this may be incorrect. But I believe the problem is that MBART models use the target-language token both as BOS and EOS for the decoder. That is, the decoder is initialized with the language tag as BOS, and terminates when it generates that tag. This creates the following set of problems:
- You cannot actually set BOS. You can prefix decode to a token (i.e., force it to predict the target language token as the first target-side token), but not set it to BOS.
- Things are complicated because fairseq uses the values of
-s
and-t
both as indices into the prepacked data and, optionally, to initialize BOS and EOS. But if you are trying to paraphrase, as your example suggests, you have to use a dummy language code for the source sentence, since you needen_XX
for the target.- I am also uncertain about using the unadapted model in this way. They train with an additional layernorm layer (which you've deleted above). But without adapting the model, it's not clear what effect removing this will have, since the pretraining made use of it.
In short, I think you have to modify the code to be able to use the model in this way, and it's not certain that, once the technical barriers are cleared, the model will do what you are hoping. But of course, that is what I was trying to test empirically (and the reason for this issue).
Hi Matt, I've tried to feed the target [en_XX] as the first input token of decoder, However, this could not make any differences. The pre-trained mbart even gets extremely low probability (lower than uniform distribution) yielded by force-decoding. Do you have any ideas about this? Than you.
Hi @vince62s,
I totally agree with your concerns about the tokenization of the RO reference and also somewhat doubt about the real fine-tuning performance.
But after finetuning the RO-EN model by myself over
mbart.cc
with the WMT16 training data, I can successfully reproduce a sarceBLEU score of 37.5 with the commandcat test.hyp | sacrebleu -t wmt16 -l ro-en
. Noted thattest.hyp
is detokenized text which makes the evaluation fully the same to the official WMT. The result indeed outperforms previous SOTA, confirming the effectiveness of mbart.
Hi @SunbowLiu Could you share the fine-tuning script? Do you have results on fine-tuning mbart for EN-RO?
This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!
Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!
❓ Questions and Help
Thanks for releasing the mbart models! However, we are unable to produce the EN-RO fine-tuned BLEU scores reported in the paper. We get a BLEU score of 26.9, using sacreBLEU's default tokenization,
v13
. This is well below the 38.5 reported in the README and even below scores reported for WMT16. Here is a complete script to reproduce this; is there anything obvious we are doing wrong?We have also tried to work with scoring the main, pretrained-only model, and were surprised to find that the names of the parameters seem to change between the main model and fine-tuned one. Perhaps documenting this is beyond the scope of your intentions with releasing the model, but it is a bit confusing when working with these models.
Code
Here is the code we run:
What's your environment?
pip
, source): sourcepip install --editable .
(within a conda env)