L-Zhe / BTmPG

Code for paper Pushing Paraphrase Away from Original Sentence: A Multi-Round Paraphrase Generation Approach by Zhe Lin, Xiaojun Wan. This paper is accepted by Findings of ACL'21.
MIT License
13 stars 5 forks source link

Lower performance with retrained model #2

Open tomhosking opened 2 years ago

tomhosking commented 2 years ago

When I use a checkpoint that I've trained from scratch instead of the checkpoint downloaded from here, performance is ~2 iBLEU lower. The command used to train the model was:

python train.py --cuda \
                --train_source ./data/qqp_train.src \
                --train_target ./data/qqp_train.tgt \
                --test_source  ./data/qqp_dev.src \
                --test_target  ./data/qqp_dev.tgt \
                --vocab_path ./checkpoints/qqp.vocab \
                --batch_size 8 \
                --epoch 100 \
                --num_rounds 2 \
                --max_length 50 \
                --clip_length 50 \
                --model_save_path ./checkpoints/qqp.model \
                --generation_save_path ./outputs/qqp/

Are there additional hyperparameters that I need to set?

L-Zhe commented 2 years ago

We do not employ iBLEU to evaluate our model, so I think you may have chosen the wrong evaluation matric.

tomhosking commented 2 years ago

Thanks for your response - iBLEU is just a weighted difference between BLEU and self-BLEU, which you do report in the paper. I get the following scores on MSCOCO when I train your model from scratch (using the command above), after 10 rounds: BLEU: 18.13, self-BLEU: 11.22 Compare to the results when I do the same with the checkpoint you've provided: BLEU: 21.30, self-BLEU: 13.84 This is much closer to the result from your paper (there will be a small difference since I'm not using exactly the same split).

I'm trying to train your model on another dataset (that you don't use in your paper), and the performance is currently much worse than the other comparison systems. So I wanted to check that I was training in the correct way, to make a fair comparison - please let me know if I should be doing anything different?

L-Zhe commented 2 years ago

I cannot hand your problem as lacking of your datasets. But I notice your batch size is too small, so I suggest you to increase it.

L-Zhe commented 2 years ago

Or you can try to close the diversity coef, in line85 of utils/run.py. This is used to solve the problem of lack of diversity in the first word.

tomhosking commented 2 years ago

Thanks - I will try reducing the length limit and increasing the batch size.

tomhosking commented 2 years ago

I've been able to train to completion using a batch size of 32 - but I now get BLEU and Self-BLEU scores of 0. It looks like training is stable at the start, but validation scores go to 0 about halfway through. Does the training script not use early stopping? How should I pick the number of training epochs?