How to evaluate with BLEU score

jzhoubu commented 5 years ago

Hi, thank you for sharing such a decent work with the repo!

After finishing training, I fail to find a way to compute the BLEU score of the model. Would you mind sharing any example code to do so? Thank you.

dojoteef commented 5 years ago

All numbers reported in our paper make use of SacreBLEU, which you can install using pip3 install sacrebleu. You also need a copy of Moses.

I've made a couple of handy scripts for computing the BLEU in the scripts folder. If for example you are trying to compute BLEU for EN-FR on WMT, then issue this command:

  1   MOSESDECODER="<PATH_TO_MOSES>" scripts/sacrebleu-wrapper.sh "<OUTPUT_FILENAME>" en fr wmt14/full

Obviously you need to fill in the correct values for <PATH_TO_MOSES> and <OUTPUT_FILENAME>.

Here is the equivalent command for using multi-bleu.perl instead, which a lot of people use to report their numbers, but shouldn't (it even states this as a warning when you run the script):

MOSESDECODER="<PATH_TO_MOSES>" scripts/multi-bleu-wrapper.sh "<OUTPUT_FILENAME>" "<PROJECT_PATH>/data/wmt_enfr/test.tok.fr"

Once again, make sure <PROJECT_PATH>/data/wmt_enfr/test.tok.fr points to the appropriate test file to compute the BLEU score against.

I hope that helps!

jzhoubu commented 5 years ago

Thanks for replying. I tried to compute the BLEU score by running

MOSESDECODER="../mosesdecoder" scripts/sacrebleu-wrapper.sh "/tmp/synst/output/translated_100000.txt" en de wmt14/full

and got the result shown below

Warning: No built-in rules for language de.
Detokenizer Version $Revision: 4134 $
Language: de
BLEU+case.mixed+lang.en-de+numrefs.1+smooth.exp+test.wmt14/full+tok.intl+version.1.4.1 = 1.6 15.8/2.6/0.7/0.2 (BP = 1.000 ratio = 3.265 hyp_len = 211193 ref_len = 64676)

translated_100000.txt is generated the translating example from README
My training parameters are from the example given in README, which means checkpoint-interval equals to 1200, and I have in total 4 checkpionts under "/tmp/synst/checkpoint"

I am wondering if I have made mistakes during evaluating or my model haven't converged?

dojoteef commented 5 years ago

Can you provide me the exact command-line you used for each of the steps? Maybe I can spot an error. For example I would expect there to be 5 checkpoints in /tmp/synst/checkpoints at the end of training.

Have you tried evaluating the perplexity? If so, what was it. Have you looked at the translations in translated_100000.txt? If so, do they look reasonable when you compare them to the ground truth reference. If not, what kinds of issues do they have?

jzhoubu commented 5 years ago

@dojoteef

Sorry, I didn't record all the commands I executed, but I remember the places I have changed: a. For preprocessing, as I mentioned at #2 b. For generation, I removed --length-basis input_lens --order-output
```
LANG=en_US.UTF-8 LC_ALL= CLASSPATH=./stanford-corenlp-full-2018-10-05/* python main.py \
--dataset wmt_en_de_parsed --span 6 -d raw/wmt -p preprocessed/wmt -v pass
```

CUDA_VISIBLE_DEVICES=0 python main.py --dataset wmt_en_de_parsed --span 6 \ --model parse_transformer -d raw/wmt -p preprocessed/wmt \ --batch-size 1 --batch-method example --split test -v \ --restore /tmp/synst/checkpoints/checkpoint.pt \ --average-checkpoints 5 translate \ --max-decode-length 50


2. I also evalute the perplexity by executing:

python main.py -b 5000 --dataset wmt_en_de_parsed --span 6 \ --model parse_transformer -d raw/wmt -p preprocessed/wmt \ --split valid --disable-cuda -v evaluate

the result is

Running torch 1.1.0 Examples=3000 Vocab Size=37686 Input Length=(min=1, avg=25, max=120) Target Length=(min=1, avg=27, max=132) Constituent Spans=(min=1.00, avg=1.90, max=6.00) Validate #0 nll=26.17(26.36): 58batch [03:19, 3.66batch/s]


3. Here is some sample I copy from `translated_100000.txt` and `translated_100000.txt.detok`

Gutach <$.1> : Gutach <$.1> : Erhöhung der Sicherheit für Fußgänger Sie sind nicht einmal 100 Meter entfernt <$.1> : Am Dienstag wurden die neuen Fußgängerbahnen in dem Dorfparkplatz in Gutach angesichts der bestehenden Rathalampel aktiv <$.1> . Zwei Lichtvon Lichten so nahe <$.1> : vorsichtlich oder nur töfalsch <$.1> ? Gestern hat der Bürgermeister von Gutacht eine klare Frage beantwortet <$.1> . " Damals wurden die Stadtaus aus Rathaus installiert <$,1> , weil dies eine Schulweg “ <$,1> , erläutert Eckert gestern <$.1> . Die Kluser Leuchter schützt Radfahrer <$,1> , ebenso wie diejenigen <$,1> , die mit dem Bus und den Einwohnern von Bergle reisen <$.1> . ``` ``` Gutach <$.1>: Gutach <$.1>: Erhöhung der Sicherheit für Fußgänger Sie sind nicht einmal 100 Meter entfernt <$.1>: Am Dienstag wurden die neuen Fußgängerbahnen in dem Dorfparkplatz in Gutach angesichts der bestehenden Rathalampel aktiv <$.1>. Zwei Lichtvon Lichten so nahe <$.1>: vorsichtlich oder nur töfalsch <$.1>? Gestern hat der Bürgermeister von Gutacht eine klare Frage beantwortet <$.1>. " Damals wurden die Stadtaus aus Rathaus installiert <$,1>, weil dies eine Schulweg“ <$,1>, erläutert Eckert gestern <$.1>. Die Kluser Leuchter schützt Radfahrer <$,1>, ebenso wie diejenigen <$,1>, die mit dem Bus und den Einwohnern von Bergle reisen <$.1>. ```

dojoteef commented 5 years ago

Thanks that helps! There are two issues with the generation:

You must specify --order-output, otherwise it will potentially output the translations in a random order, which obviously won't work for calculating BLEU
It seems I left in a -v for the command I used for generation in the README. I need to remove that. Basically, it will cause the predicted chunk sequences to be output along with the translation for debugging purposes.

Try fixing those two things and let me know if it works!

jzhoubu commented 5 years ago

After fixing the two places you point out. It works! Below is the result:

BLEU+case.mixed+lang.en-de+numrefs.1+smooth.exp+test.wmt14/full+tok.intl+version.1.4.1 = 18.5 52.3/24.7/13.3/7.5 (BP = 0.976 ratio = 0.976 hyp_len = 63130 ref_len = 64676)

dojoteef commented 5 years ago

The reason you are getting such a low BLEU score is likely because you removed --length-basis input_lens. That means it will stop decoding after 50 tokens, rather than 50 tokens + length of the source sentence (which is what the original Transformer paper does).

Try adding that back in to see if you get a higher BLEU score.

jzhoubu commented 5 years ago

Hi, @dojoteef Today I have tried using --max-decode-length 50 --length-basis input_lens --order-output and only --length-basis input_lens --order-output, but the results are the same.

CUDA_VISIBLE_DEVICES=1 python main.py --dataset wmt_en_de_parsed --span 6 \
  --model parse_transformer -d raw/wmt -p preprocessed/wmt \
  --batch-size 1 --batch-method example --split test \
  --restore /tmp/synst/checkpoints/checkpoint.pt \
  --average-checkpoints 5 translate \
  --max-decode-length 50 --length-basis input_lens --order-output

# RESULT
Warning: No built-in rules for language de.
Detokenizer Version $Revision: 4134 $
Language: de
BLEU+case.mixed+lang.en-de+numrefs.1+smooth.exp+test.wmt14/full+tok.intl+version.1.4.1 = 18.5 52.2/24.6/13.3/7.5 (BP = 0.977 ratio = 0.977 hyp_len = 63219 ref_len = 64676)

I am also wondering in which case I should use --length-basis input_lens. Is this paramter sensitive to different NMT dataset?

dojoteef commented 5 years ago

Hmm... so I'm not entirely certain what's going on. First we need to fix your reported negative log-likelihood (nll) value.

So, it looks like you never specified a checkpoint to load. That's why you're nll is so high. Try this command:

python main.py -b 5000 --dataset wmt_en_de_parsed --span 6 \
  --model parse_transformer -d raw/wmt -p preprocessed/wmt \
  --restore /tmp/synst/checkpoints/checkpoint.pt \
  --split valid -v evaluate

Do so I get the following output on my trained model:

Running torch 1.1.0
Examples=3000
Vocab Size=37686
Input Length=(min=1, avg=25, max=120)
Target Length=(min=1, avg=27, max=132)
Constituent Spans=(min=1.00, avg=1.90, max=6.00)
Loading checkpoint /tmp/synst/checkpoints/checkpoint.pt
Validate #20 nll=3.51(2.91): 58batch [00:09, 10.80batch/s]

Note how it states that it loaded a checkpoint. Try that and let and paste your results here.

jzhoubu commented 5 years ago

@dojoteef Below is my output:

Running torch 1.1.0
Examples=3000
Vocab Size=37686
Input Length=(min=1, avg=25, max=120)
Target Length=(min=1, avg=27, max=132)
Constituent Spans=(min=1.00, avg=1.90, max=6.00)
Loading checkpoint /tmp/synst/checkpoints/checkpoint.pt
Validate #5 nll=3.65(3.28): 39batch [00:09,  6.55batch/s]

jzhoubu commented 5 years ago

I guess the huge nll I reported yesterday was due to the -v which kept the chunk identifies in the generated file.

As for the BLEU score, I may need to retrained the model with more checkpoints. I just figured out --checkpoint-interval store checkpoints based on time, which means the performance from first five checkpoints highly depends on machines. How do you think?

dojoteef commented 5 years ago

The huge nll you reported yesterday was due to not specifying a checkpoint to restore from (so it had randomly initialized network weights). The -v only outputs the chunk identifiers when doing translation (there are no chunk identifiers to output during evaluation).

Your higher nll of 3.28 vs 2.91 (and associated perplexity of 26.58 vs 18.37) are likely the reasons your BLEU are not similar to those reported in our paper.

I just want to make sure that you were getting approximately 50k tokens per batch during training. So here are some questions for you:

How many GPUs did you train on? (And what type of GPUs were they)
What did you set the value for --accumulate to?
What batch size did you specify?

Also, the checkpoints are saved every --checkpoint-interval seconds, so those are updated throughout training. It's not based on the first five checkpoints.

jzhoubu commented 5 years ago

I trained on 2 * Tesla P100 (16G)
I followed the README example, so --accumulate is set to 2 and the batch size is 3175 during training

By the way, under /tmp/synst/checkpoints/, I only found 5 checkpoints in total (checkpoint.pt checkpoint1.pt checkpoint2.pt checkpoint3.pt checkpoint4.pt), is this normal?

dojoteef commented 5 years ago

It looks like your effective batch size is too small. In the README I mention that the command for training is Assuming you have access to 8 1080Ti GPUs.... With the parameters I provided you have --batch-size 3175 & --accumulate 2 resulting in:

8GPUs * 3175 Tokens/(GPU & Batch) * 2 Accumulated Batches/Optimizer Update = 50800 Tokens/Optimizer Update

Since you are only using 2 GPUs, you potentially have to increase the value of --accumulate and since the P100 has more memory (16GB vs 11GB for a 1080Ti), you should be able to increase the batch size as well. Try modifying the values such that you are getting approximately 50k tokens per optimizer update.

Note that there are 5 checkpoints because the default value for --max-checkpoints is 5. Please use the --help option to see all the available options. Additionally, look at args.py. The README is not meant to cover all the options in the codebase. It provides a small working example. Please look into the code a bit further if you are trying to adopt this for your particular usecase.

I hope that helps!

jzhoubu commented 5 years ago

@dojoteef Thanks a lot for your detailed suggestions!

jzhoubu commented 5 years ago

It took me a few days to finish training. I trained the wmt_en_de_parse (span=6) model with 8 GPUs using the parameters as the demo. The training triggered early stop at epoch 20, and below is my result, it's same as the paper:

BLEU+case.mixed+lang.en-de+numrefs.1+smooth.exp+test.wmt14/full+tok.intl+version.1.4.1 = 20.7 53.8/26.9/15.1/8.8 (BP = 0.990 ratio = 0.990 hyp_len = 64037 ref_len = 64676)

@dojoteef Thanks again for the help. I actually have the same concern as #5 . As the WMT dataset is a bit huge, could you share the parameters of IWSLT dataset so that people can enjoy a quick evaluation?

dojoteef commented 5 years ago

That's great! The details of the IWSLT hyperparameters are in the paper. I highly recommend looking through the paper if you are going to build upon this work at all.

That said, here is a command line you could use to run on IWSLT (this was run on two GPUs):

python main.py -b 3000 --dataset iwslt_en_de_parsed --span 6 \
  --model parse_transformer --embedding-size 286 --hidden-dim 507 \
  --num-layers 5 --num-heads 2 -d raw/iwslt -p preprocessed/iwslt -v train \
  --learning-rate-scheduler linear --learning-rate 3e-4 --final-learning-rate 1e-5 \
  --checkpoint-interval 600 --label-smoothing 0

dojoteef / synst

How to evaluate with BLEU score #3