Open tjshu opened 2 years ago
I guess you have to give the same arguments around inferencing, says --max-len-a 1.2 --max-len-b 10
(it is given in train, but not generate)
In addition, was it because you used a slightly different checkpoint? checkpoints/wmt17_en_de/transformer/ckpt/checkpoint_avg_last_5.pt
looks like an averaged one not the exact valid one.
I guess you have to give the same arguments around inferencing, says
--max-len-a 1.2 --max-len-b 10
(it is given in train, but not generate)In addition, was it because you used a slightly different checkpoint?
checkpoints/wmt17_en_de/transformer/ckpt/checkpoint_avg_last_5.pt
looks like an averaged one not the exact valid one.
thank,after i try, adding --max-len-a 1.2 --max-len-b 10 did not get any change. second,I also try best or last ,but did not get any change.
Would you mind do a test for me? I guess I found the reason.
fairseq-train's eval-bleu
is using sacrebleu
, you can find it in fairseq/tasks/translation.py
def _inference_with_bleu
around line 463\~498(the end of .py)
Fortunately, from the comment starts at line 470\~ it says,
The default unknown string in fairseq is
<unk>
, but this is tokenized by sacrebleu as< unk >
, inflating BLEU scores. Instead, we use a somewhat more verbose alternative that is unlikely to appear in the real reference, but doesn't get split into multiple tokens.
and as an alternative, fairseq replaces <unk>
with UNKNOWNTOKENINHYP
.
On the other hand, fairseq-generate --scoring sacrebleu
means sacrebleu but there is no replacement code found in either fairseq_cli/generate.py
nor fairseq/scoring/bleu.py
.
So I want you to do 2 test for me:
--replace-unk
argument, test it in generate.py. This may not work as the code flow is changed.dict.string(
in fairseq_cli/generate.py
and you can find 2 places.
give the src_dict one unk_string="UNKNOWNTOKENINHYP"
and the tgt_dict one unk_string="UNKNOWNTOKENINREF"
run the generation again.Please let me know if you do it and it works or not.
--
By the way, upon you changed your checkpoint. There is no change at all? or do you mean the number still not fit? It is very unlikely for a socre be the same for different checkpoints.
Would you mind do a test for me? I guess I found the reason.
fairseq-train's eval-bleu
is usingsacrebleu
, you can find it infairseq/tasks/translation.py
def _inference_with_bleu
around line 463~498(the end of .py) Fortunately, from the comment starts at line 470~ it says,The default unknown string in fairseq is
<unk>
, but this is tokenized by sacrebleu as< unk >
, inflating BLEU scores. Instead, we use a somewhat more verbose alternative that is unlikely to appear in the real reference, but doesn't get split into multiple tokens.and as an alternative, fairseq replaces
<unk>
withUNKNOWNTOKENINHYP
.On the other hand,
fairseq-generate --scoring sacrebleu
means sacrebleu but there is no replacement code found in eitherfairseq_cli/generate.py
norfairseq/scoring/bleu.py
.So I want you to do 2 test for me:
- there is a
--replace-unk
argument, test it in generate.py. This may not work as the code flow is changed.- search
dict.string(
infairseq_cli/generate.py
and you can find 2 places. give the src_dict oneunk_string="UNKNOWNTOKENINHYP"
and the tgt_dict oneunk_string="UNKNOWNTOKENINREF"
run the generation again.Please let me know if you do it and it works or not.
--
By the way, upon you changed your checkpoint. There is no change at all? or do you mean the number still not fit? It is very unlikely for a socre be the same for different checkpoints. I total add -replace-unk --dataset-impl=raw --source-lang en --target-lang de in fairseq-generate
AssertionError: --replace-unk requires a raw text dataset (--dataset-impl=raw) FileNotFoundError: Dataset not found: test (data-bin/wmt14_en_de) Exception: Could not infer language pair, please provide it explicitly ( --source-lang en --target-lang de) FileNotFoundError: Dataset not found: test (data-bin/wmt14_en_de)
data-bin/wmt14_en_de is below
and ...I dont know why test is not test.en or test.de should i prepare data again by any Command?
sorry,there is change a litte ,but it is far from 27
thank
Leave that --replace-unk alone, it is not meant to be used in the document any way.
Do the 2
is okay.
Leave that --replace-unk alone, it is not meant to be used in the document any way. Do the
2
is okay.
only give —replace-unk will get this error
AssertionError: --replace-unk requires a raw text dataset (--dataset-impl=raw)
@tjshu What is the situation now? Have you tried the following?
search
dict.string(
in fairseq_cli/generate.py and you can find 2 places. give the src_dict one unk_string="UNKNOWNTOKENINHYP" and the tgt_dict one unk_string="UNKNOWNTOKENINREF" run the generation again.
--replace-unk
is not needed as it is asked in fairseq-preprocess. The 2nd fix I mentioned do not need this as well.
@tjshu What is the situation now? Have you tried the following?
search
dict.string(
in fairseq_cli/generate.py and you can find 2 places. give the src_dict one unk_string="UNKNOWNTOKENINHYP" and the tgt_dict one unk_string="UNKNOWNTOKENINREF" run the generation again.
--replace-unk
is not needed as it is asked in fairseq-preprocess. The 2nd fix I mentioned do not need this as well.
I have tried 2nd,but no any change.
I guess that I get valid blue 27.XX too low In evaluate ,it may get a little low than valid blue 27.XX ,so should I train a model better than 28 or 28.5 ? The key problem is in here? I train 2nd valid 27.39 then get evaluate blue 23.87 Thank you
Thanks for telling me, I may look into this difference in bleu score. Though at the fastest pace, the debug will be at the end of this week.
It is up to you to decide how much training is the best. Higher score is always appealing. Well, if the evaluation changes from the training result without a reason, that is a must-solve problem.
Thanks for telling me, I may look into this difference in bleu score. Though at the fastest pace, the debug will be at the end of this week.
It is up to you to decide how much training is the best. Higher score is always appealing. Well, if the evaluation changes from the training result without a reason, that is a must-solve problem.
Can I ask you how difference between in valid and evaluate when you train? Because I did not success any time, l don’t known how is normal. why I guess this is key problem is that my classmate tell me valid too low. thank you
I do not get your question.
If no --eval-bleu
, valid loss comes from the criterion (label_smoothed_cross_entropy
this time),
cross entropy is calculated by giving target sentences as decoder input to the model. So there is no generation happened.
If --eval-bleu
is given, then valid will use inference_step to create hypothesises for calculating sacrebleu.
fairseq-generate
uses the same inference to create hypothesises as well.
So there is no difference in model generation. The only difference is score calculation.
I do not get your question.
If no
--eval-bleu
, valid loss comes from the criterion (label_smoothed_cross_entropy
this time), cross entropy is calculated by giving target sentences as decoder input to the model. So there is no generation happened. If--eval-bleu
is given, then valid will use inference_step to create hypothesises for calculating sacrebleu.
fairseq-generate
uses the same inference to create hypothesises as well.So there is no difference in model generation. The only difference is score calculation.
Sorry, Can I ask you How big is the BLUE gap between in valid and evaluate when you train? a little or no?
❓ Questions and Help
Before asking:
What is your question?
as the title,valid up to 27.7 but evaluate only 25
Code
Download and prepare the data
cd examples/translation/ bash prepare-wmt14en2de.sh --icml17
cd ../..
Binarize the dataset
TEXT=examples/translation/wmt17_en_de fairseq-preprocess \ --source-lang en --target-lang de \ --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \ --destdir data-bin/wmt17_en_de --thresholdtgt 0 --thresholdsrc 0 \ --workers 20
Train the model
PYTHONIOENCODING=utf-8 fairseq-train \ data-bin/wmt17_en_de \ --arch transformer_wmt_en_de --share-decoder-input-output-embed \ --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \ --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \ --dropout 0.3 --weight-decay 0.0001 \ --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ --max-tokens 4096 \ --max-tokens-valid 4096 \ --update-freq 1 \ --eval-bleu \ --eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' \ --eval-bleu-detok moses \ --eval-bleu-remove-bpe \ --eval-bleu-print-samples \ --best-checkpoint-metric bleu --maximize-best-checkpoint-metric \ --save-dir checkpoints/wmt17_en_de/transformer/ckpt \ --log-format json \ --keep-last-epochs 5 \ --max-epoch 30 \ --fp16 \ | tee checkpoints/wmt17_en_de/transformer/train.log
Evaluate
PYTHONIOENCODING=utf-8 fairseq-generate data-bin/wmt17_en_de \ --path checkpoints/wmt17_en_de/transformer/ckpt/checkpoint_avg_last_5.pt \ --batch-size 128 --beam 5 --remove-bpe \ | tee checkpoints/wmt17_en_de/transformer/evaluate/evaluate.log
What have you tried?
i had try to add --scoring sacrebleu,but BLUE a littlt up
What's your environment?
pip
, source):source