Closed jzhoubu closed 5 years ago
All numbers reported in our paper make use of SacreBLEU, which you can install using pip3 install sacrebleu
. You also need a copy of Moses.
I've made a couple of handy scripts for computing the BLEU in the scripts folder. If for example you are trying to compute BLEU for EN-FR on WMT, then issue this command:
1 MOSESDECODER="<PATH_TO_MOSES>" scripts/sacrebleu-wrapper.sh "<OUTPUT_FILENAME>" en fr wmt14/full
Obviously you need to fill in the correct values for <PATH_TO_MOSES>
and <OUTPUT_FILENAME>
.
Here is the equivalent command for using multi-bleu.perl
instead, which a lot of people use to report their numbers, but shouldn't (it even states this as a warning when you run the script):
MOSESDECODER="<PATH_TO_MOSES>" scripts/multi-bleu-wrapper.sh "<OUTPUT_FILENAME>" "<PROJECT_PATH>/data/wmt_enfr/test.tok.fr"
Once again, make sure <PROJECT_PATH>/data/wmt_enfr/test.tok.fr
points to the appropriate test file to compute the BLEU score against.
I hope that helps!
Thanks for replying. I tried to compute the BLEU score by running
MOSESDECODER="../mosesdecoder" scripts/sacrebleu-wrapper.sh "/tmp/synst/output/translated_100000.txt" en de wmt14/full
and got the result shown below
Warning: No built-in rules for language de.
Detokenizer Version $Revision: 4134 $
Language: de
BLEU+case.mixed+lang.en-de+numrefs.1+smooth.exp+test.wmt14/full+tok.intl+version.1.4.1 = 1.6 15.8/2.6/0.7/0.2 (BP = 1.000 ratio = 3.265 hyp_len = 211193 ref_len = 64676)
translated_100000.txt
is generated the translating example from READMEcheckpoint-interval
equals to 1200, and I have in total 4 checkpionts under "/tmp/synst/checkpoint"I am wondering if I have made mistakes during evaluating or my model haven't converged?
Can you provide me the exact command-line you used for each of the steps? Maybe I can spot an error. For example I would expect there to be 5 checkpoints in /tmp/synst/checkpoints
at the end of training.
Have you tried evaluating the perplexity? If so, what was it.
Have you looked at the translations in translated_100000.txt
? If so, do they look reasonable when you compare them to the ground truth reference. If not, what kinds of issues do they have?
@dojoteef
--length-basis input_lens --order-output
LANG=en_US.UTF-8 LC_ALL= CLASSPATH=./stanford-corenlp-full-2018-10-05/* python main.py \
--dataset wmt_en_de_parsed --span 6 -d raw/wmt -p preprocessed/wmt -v pass
CUDA_VISIBLE_DEVICES=0 python main.py --dataset wmt_en_de_parsed --span 6 \ --model parse_transformer -d raw/wmt -p preprocessed/wmt \ --batch-size 1 --batch-method example --split test -v \ --restore /tmp/synst/checkpoints/checkpoint.pt \ --average-checkpoints 5 translate \ --max-decode-length 50
2. I also evalute the perplexity by executing:
python main.py -b 5000 --dataset wmt_en_de_parsed --span 6 \ --model parse_transformer -d raw/wmt -p preprocessed/wmt \ --split valid --disable-cuda -v evaluate
the result is
Running torch 1.1.0 Examples=3000 Vocab Size=37686 Input Length=(min=1, avg=25, max=120) Target Length=(min=1, avg=27, max=132) Constituent Spans=(min=1.00, avg=1.90, max=6.00) Validate #0 nll=26.17(26.36): 58batch [03:19, 3.66batch/s]
3. Here is some sample I copy from `translated_100000.txt` and `translated_100000.txt.detok`
Thanks that helps! There are two issues with the generation:
--order-output
, otherwise it will potentially output the translations in a random order, which obviously won't work for calculating BLEU-v
for the command I used for generation in the README. I need to remove that. Basically, it will cause the predicted chunk sequences to be output along with the translation for debugging purposes.Try fixing those two things and let me know if it works!
After fixing the two places you point out. It works! Below is the result:
BLEU+case.mixed+lang.en-de+numrefs.1+smooth.exp+test.wmt14/full+tok.intl+version.1.4.1 = 18.5 52.3/24.7/13.3/7.5 (BP = 0.976 ratio = 0.976 hyp_len = 63130 ref_len = 64676)
The reason you are getting such a low BLEU score is likely because you removed --length-basis input_lens
. That means it will stop decoding after 50 tokens, rather than 50 tokens + length of the source sentence (which is what the original Transformer paper does).
Try adding that back in to see if you get a higher BLEU score.
Hi, @dojoteef
Today I have tried using --max-decode-length 50 --length-basis input_lens --order-output
and only --length-basis input_lens --order-output
, but the results are the same.
CUDA_VISIBLE_DEVICES=1 python main.py --dataset wmt_en_de_parsed --span 6 \
--model parse_transformer -d raw/wmt -p preprocessed/wmt \
--batch-size 1 --batch-method example --split test \
--restore /tmp/synst/checkpoints/checkpoint.pt \
--average-checkpoints 5 translate \
--max-decode-length 50 --length-basis input_lens --order-output
# RESULT
Warning: No built-in rules for language de.
Detokenizer Version $Revision: 4134 $
Language: de
BLEU+case.mixed+lang.en-de+numrefs.1+smooth.exp+test.wmt14/full+tok.intl+version.1.4.1 = 18.5 52.2/24.6/13.3/7.5 (BP = 0.977 ratio = 0.977 hyp_len = 63219 ref_len = 64676)
I am also wondering in which case I should use --length-basis input_lens
. Is this paramter sensitive to different NMT dataset?
Hmm... so I'm not entirely certain what's going on. First we need to fix your reported negative log-likelihood (nll) value.
So, it looks like you never specified a checkpoint to load. That's why you're nll is so high. Try this command:
python main.py -b 5000 --dataset wmt_en_de_parsed --span 6 \
--model parse_transformer -d raw/wmt -p preprocessed/wmt \
--restore /tmp/synst/checkpoints/checkpoint.pt \
--split valid -v evaluate
Do so I get the following output on my trained model:
Running torch 1.1.0
Examples=3000
Vocab Size=37686
Input Length=(min=1, avg=25, max=120)
Target Length=(min=1, avg=27, max=132)
Constituent Spans=(min=1.00, avg=1.90, max=6.00)
Loading checkpoint /tmp/synst/checkpoints/checkpoint.pt
Validate #20 nll=3.51(2.91): 58batch [00:09, 10.80batch/s]
Note how it states that it loaded a checkpoint. Try that and let and paste your results here.
@dojoteef Below is my output:
Running torch 1.1.0
Examples=3000
Vocab Size=37686
Input Length=(min=1, avg=25, max=120)
Target Length=(min=1, avg=27, max=132)
Constituent Spans=(min=1.00, avg=1.90, max=6.00)
Loading checkpoint /tmp/synst/checkpoints/checkpoint.pt
Validate #5 nll=3.65(3.28): 39batch [00:09, 6.55batch/s]
I guess the huge nll I reported yesterday was due to the -v
which kept the chunk identifies in the generated file.
As for the BLEU score, I may need to retrained the model with more checkpoints. I just figured out --checkpoint-interval
store checkpoints based on time, which means the performance from first five checkpoints highly depends on machines. How do you think?
The huge nll you reported yesterday was due to not specifying a checkpoint to restore from (so it had randomly initialized network weights). The -v
only outputs the chunk identifiers when doing translation (there are no chunk identifiers to output during evaluation).
Your higher nll of 3.28
vs 2.91
(and associated perplexity of 26.58
vs 18.37
) are likely the reasons your BLEU are not similar to those reported in our paper.
I just want to make sure that you were getting approximately 50k tokens per batch during training. So here are some questions for you:
--accumulate
to?Also, the checkpoints are saved every --checkpoint-interval
seconds, so those are updated throughout training. It's not based on the first five checkpoints.
--accumulate
is set to 2 and the batch size is 3175 during trainingBy the way, under /tmp/synst/checkpoints/, I only found 5 checkpoints in total (checkpoint.pt checkpoint1.pt checkpoint2.pt checkpoint3.pt checkpoint4.pt), is this normal?
It looks like your effective batch size is too small. In the README I mention that the command for training is Assuming you have access to 8 1080Ti GPUs...
. With the parameters I provided you have --batch-size 3175
& --accumulate 2
resulting in:
8GPUs * 3175 Tokens/(GPU & Batch) * 2 Accumulated Batches/Optimizer Update = 50800 Tokens/Optimizer Update
Since you are only using 2 GPUs, you potentially have to increase the value of --accumulate
and since the P100 has more memory (16GB vs 11GB for a 1080Ti), you should be able to increase the batch size as well. Try modifying the values such that you are getting approximately 50k tokens per optimizer update.
Note that there are 5 checkpoints because the default value for --max-checkpoints
is 5. Please use the --help
option to see all the available options. Additionally, look at args.py
. The README is not meant to cover all the options in the codebase. It provides a small working example. Please look into the code a bit further if you are trying to adopt this for your particular usecase.
I hope that helps!
@dojoteef Thanks a lot for your detailed suggestions!
It took me a few days to finish training. I trained the wmt_en_de_parse
(span=6) model with 8 GPUs using the parameters as the demo. The training triggered early stop at epoch 20, and below is my result, it's same as the paper:
BLEU+case.mixed+lang.en-de+numrefs.1+smooth.exp+test.wmt14/full+tok.intl+version.1.4.1 = 20.7 53.8/26.9/15.1/8.8 (BP = 0.990 ratio = 0.990 hyp_len = 64037 ref_len = 64676)
@dojoteef Thanks again for the help. I actually have the same concern as #5 . As the WMT dataset is a bit huge, could you share the parameters of IWSLT dataset so that people can enjoy a quick evaluation?
That's great! The details of the IWSLT hyperparameters are in the paper. I highly recommend looking through the paper if you are going to build upon this work at all.
That said, here is a command line you could use to run on IWSLT (this was run on two GPUs):
python main.py -b 3000 --dataset iwslt_en_de_parsed --span 6 \
--model parse_transformer --embedding-size 286 --hidden-dim 507 \
--num-layers 5 --num-heads 2 -d raw/iwslt -p preprocessed/iwslt -v train \
--learning-rate-scheduler linear --learning-rate 3e-4 --final-learning-rate 1e-5 \
--checkpoint-interval 600 --label-smoothing 0
Hi, thank you for sharing such a decent work with the repo!
After finishing training, I fail to find a way to compute the BLEU score of the model. Would you mind sharing any example code to do so? Thank you.