facebookresearch / Mask-Predict

A masked language modeling objective to train a model to predict any subset of the target words, conditioned on both the input text and a partially masked target translation.
Other
240 stars 38 forks source link

can not reproduce the result on wmt14 en-de #6

Open shawnkx opened 4 years ago

shawnkx commented 4 years ago

Hi,

Thanks so much for releasing your models and data. However, after running the following command, I could only get 9.75 for BLEU 4 on wmt14 ende. python generate_cmlm.py data-bin/wmt14.en-de/ --path models/wmt14-ende/maskPredict_en_de/checkpoint_best.pt --task translation_self --remove-bpe --max-sentences 20 --decoding-iterations 10 --decoding-strategy mask_predict

Any idea where I am wrong? Thanks!

yinhanliu commented 4 years ago

looks like your dictionary is messed up. can you double check your binarized data in your data-bin is based on the dictionary that we released?

slih-sh commented 4 years ago

I also have a similar issue. The result can be reproduced using your trained model. But I get a very low BLEU score for the model I trained myself using your code. Is there something that I may have missed?

yinhanliu commented 4 years ago

@slih-sh have you seen this https://github.com/facebookresearch/Mask-Predict/issues/3#issuecomment-550131786?

yinhanliu commented 4 years ago

@slih-sh did you preprocess your train/dev/test data all at the same time? and what was your ppl on valid of your trained model?

slih-sh commented 4 years ago

@yinhanliu Thank you for your reply. I used the script "get_data.sh" for preprocessing, did this affect the result?

yinhanliu commented 4 years ago

no, should not.

slih-sh commented 4 years ago

@yinhanliu I trained the model on 8 gpu and set the update-freq to 2. I got a 23.13 Bleu score with 10 iteration decoding. I think there is still a gap between 23.13 and 24.61, what could be the reason?

yinhanliu commented 4 years ago

What distilled data did you use?

slih-sh commented 4 years ago

What distilled data did you use?

I didn't use distilled data. Because 24.61 is the score with raw data and the score with knowledge distillation is 27.03, right?

yinhanliu commented 4 years ago

I see your setup now. How long did you train the model (upto 300K?) and what was your max-token? If you trained on 16GB GPU (4096 max token), you have to double your update-freq.

slih-sh commented 4 years ago

The max-token is 8192(16GB GPU) and I trained the model up to 300k,now the score is 23.82. But there is still a 0.79 gap, any idea?

yinhanliu commented 4 years ago

one thing for sure is that, if you train longer, the score can be better.

Based on the information, you probably have some different optimization parameters than ours.

On Fri, Dec 20, 2019 at 1:50 AM slih-sh notifications@github.com wrote:

The max-token is 8192(16GB GPU) and I trained the model up to 300k,now the score is 23.82. But there is still a 0.79 gap, any idea?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/facebookresearch/Mask-Predict/issues/6?email_source=notifications&email_token=AJQ6TRIP3NIP2Z7XQZCTYF3QZSIPTA5CNFSM4JKORL52YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHMOUPA#issuecomment-567863868, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJQ6TRPOOG4GU6SQ6HP7NN3QZSIPTANCNFSM4JKORL5Q .

-- Best Regards,

Yinhan Liu

Graduate Student at the University of Texas at Austin

alphadl commented 4 years ago

@slih-sh did you preprocess your train/dev/test data all at the same time? and what was your ppl on valid of your trained model?

my current model performance on valid set shows:

valid on 'valid' subset | loss 3.263 | nll_loss 1.138 | ppl 2.20 | best_loss 3.24604 | length_loss 3.40198

However, the BLEU score of the test set is merely 24.8. Do you have suggestions for improving performance? @yinhanliu

yinhanliu commented 4 years ago

@alphadl can you give me more info about what you are training on? your batch size in term of number of tokens, and gpus, and your dataset -- distilled or original data?

alphadl commented 4 years ago

@alphadl can you give me more info about what you are training on? your batch size in term of number of tokens, and gpus, and your dataset -- distilled or original data?

distilled data/ Batch size is 8192 tokends * 4 GPUs with update_freq = 4.

alphadl commented 4 years ago

BTW. I suggest that you can add the post-process script compound_split_bleu.sh in evaluation phase such that we can obtain the reported BLEU score in your paper.

yinhanliu commented 4 years ago

@alphadl can you give me more info about what you are training on? your batch size in term of number of tokens, and gpus, and your dataset -- distilled or original data?

distilled data/ Batch size is 8192 tokends * 4 GPUs with update_freq = 4.

Parameters look good to me. usually we train the model for 1.5 days and for your case (4Update freq), it might take more than 3 days. Also can you share how you generate your distilled data?

alphadl commented 4 years ago

@alphadl can you give me more info about what you are training on? your batch size in term of number of tokens, and gpus, and your dataset -- distilled or original data?

distilled data/ Batch size is 8192 tokends * 4 GPUs with update_freq = 4.

Parameters look good to me. usually we train the model for 1.5 days and for your case (4Update freq), it might take more than 3 days. Also can you share how you generate your distilled data?

Thanks for your prompt reply, the distilled data is derived from the strong AT model scaling NMT.

yeliu918 commented 4 years ago

Hi,

I also just get 9.75 for BLEU 4 on wmt14.en_de (far less than the paper report) but 30.11 for BLEU 4 on wmt14.en_de (almost same in the paper) using your leased trained model. I use the get_data.sh to processing the data and I check the vocabulary is same with the your released MaskPredict/checkpoint_best.pt.

And I'm confused that in the training, I use the same hyperparameter as you released. data-bin/wmt14.en-de --arch bert_transformer_seq2seq --criterion label_smoothed_length_cross_entropy --label-smoothing 0.1 --lr 5e-4 --warmup-init-lr 1e-7 --min-lr 1e-9 --lr-scheduler inverse_sqrt --warmup-updates 10000 --optimizer adam --adam-betas '(0.9, 0.999)' --adam-eps 1e-6 --task translation_self --max-tokens 8192 --weight-decay 0.01 --dropout 0.3 --encoder-layers 6 --encoder-embed-dim 512 --decoder-layers 6 --decoder-embed-dim 512 --max-source-positions 10000 --max-target-positions 10000 --max-update 300000 --seed 0 --save-dir output_de the loss doesn't go down and I trained 17 epoch on wmt14.en_de: epoch 001 | valid on 'valid' subset | loss 12.210 | nll_loss 11.213 | ppl 2373.37 | num_updates 18681 | length_loss 6.73082 epoch 002 | valid on 'valid' subset | loss 12.204 | nll_loss 11.180 | ppl 2319.64 | num_updates 37373 | best_loss 12.2038 | length_loss 6.72529 epoch 003 | valid on 'valid' subset | loss 12.248 | nll_loss 11.243 | ppl 2422.95 | num_updates 56065 | best_loss 12.2038 | length_loss 6.72258 epoch 004 | valid on 'valid' subset | loss 12.445 | nll_loss 11.358 | ppl 2625.23 | num_updates 74754 | best_loss 12.2038 | length_loss 7.02183 epoch 005 | valid on 'valid' subset | loss 12.385 | nll_loss 11.307 | ppl 2534.36 | num_updates 93446 | best_loss 12.2038 | length_loss 7.2854 epoch 006 | valid on 'valid' subset | loss 12.270 | nll_loss 11.282 | ppl 2490.07 | num_updates 112138 | best_loss 12.2038 | length_loss 7.59447 epoch 008 | valid on 'valid' subset | loss 12.769 | nll_loss 11.786 | ppl 3531.21 | num_updates 149522 | best_loss 12.2038 | length_loss 7.25636 epoch 010 | valid on 'valid' subset | loss 13.763 | nll_loss 12.413 | ppl 5452.60 | num_updates 186905 | best_loss 12.2038 | length_loss 11.7421 epoch 011 | valid on 'valid' subset | loss 12.597 | nll_loss 11.620 | ppl 3148.15 | num_updates 205596 | best_loss 12.2038 | length_loss 6.6838 epoch 012 | valid on 'valid' subset | loss 12.768 | nll_loss 11.813 | ppl 3596.91 | num_updates 224288 | best_loss 12.2038 | length_loss 6.70741 epoch 013 | valid on 'valid' subset | loss 12.453 | nll_loss 11.422 | ppl 2743.58 | num_updates 242978 | best_loss 12.2038 | length_loss 6.69415 epoch 014 | valid on 'valid' subset | loss 13.029 | nll_loss 12.029 | ppl 4178.98 | num_updates 261669 | best_loss 12.2038 | length_loss 6.69363 epoch 015 | valid on 'valid' subset | loss 12.974 | nll_loss 11.896 | ppl 3810.69 | num_updates 280361 | best_loss 12.2038 | length_loss 6.68126 epoch 016 | valid on 'valid' subset | loss 12.752 | nll_loss 11.766 | ppl 3483.55 | num_updates 299052 | best_loss 12.2038 | length_loss 6.99781 epoch 017 | valid on 'valid' subset | loss 13.223 | nll_loss 12.223 | ppl 4779.39 | num_updates 300000 | best_loss 12.2038 | length_loss 7.00145 In generation, using the checkpoint_best.pt but can only get ,,,,,. image

jungokasai commented 4 years ago

I have some followup on this: Figure 3 in https://arxiv.org/pdf/2001.05136.pdf. It looks that the CMLM's performance deteriorates by more than 0.5 BLEU points by halving the overall batch size from the original --max-tokens 8192 --distributed-world-size 16. Because of the nature of CMLM training where only masked tokens are predicted, a bigger batch size is required to perform well. Also it didn't converge with --max-tokens 8192 --distributed-world-size 1.