facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.29k stars 6.39k forks source link

why BLUE difference between evaluate and valid in WMT14 translation in transformer #4477

Open tjshu opened 2 years ago

tjshu commented 2 years ago

❓ Questions and Help

Before asking:

  1. search the issues.
  2. search the docs.

What is your question?

as the title,valid up to 27.7 but evaluate only 25

Code

Download and prepare the data

cd examples/translation/ bash prepare-wmt14en2de.sh --icml17

cd ../..

Binarize the dataset

TEXT=examples/translation/wmt17_en_de fairseq-preprocess \ --source-lang en --target-lang de \ --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \ --destdir data-bin/wmt17_en_de --thresholdtgt 0 --thresholdsrc 0 \ --workers 20

Train the model

PYTHONIOENCODING=utf-8 fairseq-train \ data-bin/wmt17_en_de \ --arch transformer_wmt_en_de --share-decoder-input-output-embed \ --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \ --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \ --dropout 0.3 --weight-decay 0.0001 \ --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ --max-tokens 4096 \ --max-tokens-valid 4096 \ --update-freq 1 \ --eval-bleu \ --eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' \ --eval-bleu-detok moses \ --eval-bleu-remove-bpe \ --eval-bleu-print-samples \ --best-checkpoint-metric bleu --maximize-best-checkpoint-metric \ --save-dir checkpoints/wmt17_en_de/transformer/ckpt \ --log-format json \ --keep-last-epochs 5 \ --max-epoch 30 \ --fp16 \ | tee checkpoints/wmt17_en_de/transformer/train.log

Evaluate

PYTHONIOENCODING=utf-8 fairseq-generate data-bin/wmt17_en_de \ --path checkpoints/wmt17_en_de/transformer/ckpt/checkpoint_avg_last_5.pt \ --batch-size 128 --beam 5 --remove-bpe \ | tee checkpoints/wmt17_en_de/transformer/evaluate/evaluate.log

What have you tried?

i had try to add --scoring sacrebleu,but BLUE a littlt up

What's your environment?

gmryu commented 2 years ago

I guess you have to give the same arguments around inferencing, says --max-len-a 1.2 --max-len-b 10 (it is given in train, but not generate)

In addition, was it because you used a slightly different checkpoint? checkpoints/wmt17_en_de/transformer/ckpt/checkpoint_avg_last_5.pt looks like an averaged one not the exact valid one.

tjshu commented 2 years ago

I guess you have to give the same arguments around inferencing, says --max-len-a 1.2 --max-len-b 10 (it is given in train, but not generate)

In addition, was it because you used a slightly different checkpoint? checkpoints/wmt17_en_de/transformer/ckpt/checkpoint_avg_last_5.pt looks like an averaged one not the exact valid one.

thank,after i try, adding --max-len-a 1.2 --max-len-b 10 did not get any change. second,I also try best or last ,but did not get any change.

gmryu commented 2 years ago

Would you mind do a test for me? I guess I found the reason. fairseq-train's eval-bleu is using sacrebleu, you can find it in fairseq/tasks/translation.py def _inference_with_bleu around line 463\~498(the end of .py) Fortunately, from the comment starts at line 470\~ it says,

The default unknown string in fairseq is <unk>, but this is tokenized by sacrebleu as < unk >, inflating BLEU scores. Instead, we use a somewhat more verbose alternative that is unlikely to appear in the real reference, but doesn't get split into multiple tokens.

and as an alternative, fairseq replaces <unk> with UNKNOWNTOKENINHYP.

On the other hand, fairseq-generate --scoring sacrebleu means sacrebleu but there is no replacement code found in either fairseq_cli/generate.py nor fairseq/scoring/bleu.py.

So I want you to do 2 test for me:

  1. there is a --replace-unk argument, test it in generate.py. This may not work as the code flow is changed.
  2. search dict.string( in fairseq_cli/generate.py and you can find 2 places. give the src_dict one unk_string="UNKNOWNTOKENINHYP" and the tgt_dict one unk_string="UNKNOWNTOKENINREF" run the generation again.

Please let me know if you do it and it works or not.

--

By the way, upon you changed your checkpoint. There is no change at all? or do you mean the number still not fit? It is very unlikely for a socre be the same for different checkpoints.

tjshu commented 2 years ago

Would you mind do a test for me? I guess I found the reason. fairseq-train's eval-bleu is using sacrebleu, you can find it in fairseq/tasks/translation.py def _inference_with_bleu around line 463~498(the end of .py) Fortunately, from the comment starts at line 470~ it says,

The default unknown string in fairseq is <unk>, but this is tokenized by sacrebleu as < unk >, inflating BLEU scores. Instead, we use a somewhat more verbose alternative that is unlikely to appear in the real reference, but doesn't get split into multiple tokens.

and as an alternative, fairseq replaces <unk> with UNKNOWNTOKENINHYP.

On the other hand, fairseq-generate --scoring sacrebleu means sacrebleu but there is no replacement code found in either fairseq_cli/generate.py nor fairseq/scoring/bleu.py.

So I want you to do 2 test for me:

  1. there is a --replace-unk argument, test it in generate.py. This may not work as the code flow is changed.
  2. search dict.string( in fairseq_cli/generate.py and you can find 2 places. give the src_dict one unk_string="UNKNOWNTOKENINHYP" and the tgt_dict one unk_string="UNKNOWNTOKENINREF" run the generation again.

Please let me know if you do it and it works or not.

--

By the way, upon you changed your checkpoint. There is no change at all? or do you mean the number still not fit? It is very unlikely for a socre be the same for different checkpoints. I total add -replace-unk --dataset-impl=raw --source-lang en --target-lang de in fairseq-generate

AssertionError: --replace-unk requires a raw text dataset (--dataset-impl=raw) FileNotFoundError: Dataset not found: test (data-bin/wmt14_en_de) Exception: Could not infer language pair, please provide it explicitly ( --source-lang en --target-lang de) FileNotFoundError: Dataset not found: test (data-bin/wmt14_en_de)

data-bin/wmt14_en_de is below image

and ...I dont know why test is not test.en or test.de should i prepare data again by any Command?

sorry,there is change a litte ,but it is far from 27

thank

gmryu commented 2 years ago

Leave that --replace-unk alone, it is not meant to be used in the document any way. Do the 2 is okay.

tjshu commented 2 years ago

Leave that --replace-unk alone, it is not meant to be used in the document any way. Do the 2 is okay.

only give —replace-unk will get this error

AssertionError: --replace-unk requires a raw text dataset (--dataset-impl=raw)

gmryu commented 2 years ago

@tjshu What is the situation now? Have you tried the following?

search dict.string( in fairseq_cli/generate.py and you can find 2 places. give the src_dict one unk_string="UNKNOWNTOKENINHYP" and the tgt_dict one unk_string="UNKNOWNTOKENINREF" run the generation again.

--replace-unk is not needed as it is asked in fairseq-preprocess. The 2nd fix I mentioned do not need this as well.

tjshu commented 2 years ago

@tjshu What is the situation now? Have you tried the following?

search dict.string( in fairseq_cli/generate.py and you can find 2 places. give the src_dict one unk_string="UNKNOWNTOKENINHYP" and the tgt_dict one unk_string="UNKNOWNTOKENINREF" run the generation again.

--replace-unk is not needed as it is asked in fairseq-preprocess. The 2nd fix I mentioned do not need this as well.

I have tried 2nd,but no any change.

I guess that I get valid blue 27.XX too low In evaluate ,it may get a little low than valid blue 27.XX ,so should I train a model better than 28 or 28.5 ? The key problem is in here? I train 2nd valid 27.39 then get evaluate blue 23.87 Thank you

gmryu commented 2 years ago

Thanks for telling me, I may look into this difference in bleu score. Though at the fastest pace, the debug will be at the end of this week.

It is up to you to decide how much training is the best. Higher score is always appealing. Well, if the evaluation changes from the training result without a reason, that is a must-solve problem.

tjshu commented 2 years ago

Thanks for telling me, I may look into this difference in bleu score. Though at the fastest pace, the debug will be at the end of this week.

It is up to you to decide how much training is the best. Higher score is always appealing. Well, if the evaluation changes from the training result without a reason, that is a must-solve problem.

Can I ask you how difference between in valid and evaluate when you train? Because I did not success any time, l don’t known how is normal. why I guess this is key problem is that my classmate tell me valid too low. thank you

gmryu commented 2 years ago

I do not get your question.

If no --eval-bleu, valid loss comes from the criterion (label_smoothed_cross_entropy this time), cross entropy is calculated by giving target sentences as decoder input to the model. So there is no generation happened. If --eval-bleu is given, then valid will use inference_step to create hypothesises for calculating sacrebleu.

fairseq-generate uses the same inference to create hypothesises as well.

So there is no difference in model generation. The only difference is score calculation.

tjshu commented 2 years ago

I do not get your question.

If no --eval-bleu, valid loss comes from the criterion (label_smoothed_cross_entropy this time), cross entropy is calculated by giving target sentences as decoder input to the model. So there is no generation happened. If --eval-bleu is given, then valid will use inference_step to create hypothesises for calculating sacrebleu.

fairseq-generate uses the same inference to create hypothesises as well.

So there is no difference in model generation. The only difference is score calculation.

Sorry, Can I ask you How big is the BLUE gap between in valid and evaluate when you train? a little or no?