can not reproduce the result on wmt14 en-de

shawnkx commented 4 years ago

Hi,

Thanks so much for releasing your models and data. However, after running the following command, I could only get 9.75 for BLEU 4 on wmt14 ende. python generate_cmlm.py data-bin/wmt14.en-de/ --path models/wmt14-ende/maskPredict_en_de/checkpoint_best.pt --task translation_self --remove-bpe --max-sentences 20 --decoding-iterations 10 --decoding-strategy mask_predict

Any idea where I am wrong? Thanks!

yinhanliu commented 4 years ago

looks like your dictionary is messed up. can you double check your binarized data in your data-bin is based on the dictionary that we released?

slih-sh commented 4 years ago

I also have a similar issue. The result can be reproduced using your trained model. But I get a very low BLEU score for the model I trained myself using your code. Is there something that I may have missed?

yinhanliu commented 4 years ago

@slih-sh have you seen this https://github.com/facebookresearch/Mask-Predict/issues/3#issuecomment-550131786?

yinhanliu commented 4 years ago

@slih-sh did you preprocess your train/dev/test data all at the same time? and what was your ppl on valid of your trained model?

slih-sh commented 4 years ago

@yinhanliu Thank you for your reply. I used the script "get_data.sh" for preprocessing, did this affect the result?

yinhanliu commented 4 years ago

no, should not.

slih-sh commented 4 years ago

@yinhanliu I trained the model on 8 gpu and set the update-freq to 2. I got a 23.13 Bleu score with 10 iteration decoding. I think there is still a gap between 23.13 and 24.61, what could be the reason?

yinhanliu commented 4 years ago

What distilled data did you use?

slih-sh commented 4 years ago

What distilled data did you use?

I didn't use distilled data. Because 24.61 is the score with raw data and the score with knowledge distillation is 27.03, right?

yinhanliu commented 4 years ago

I see your setup now. How long did you train the model (upto 300K?) and what was your max-token? If you trained on 16GB GPU (4096 max token), you have to double your update-freq.

slih-sh commented 4 years ago

The max-token is 8192(16GB GPU) and I trained the model up to 300k，now the score is 23.82. But there is still a 0.79 gap, any idea?

yinhanliu commented 4 years ago

one thing for sure is that, if you train longer, the score can be better.

Based on the information, you probably have some different optimization parameters than ours.

On Fri, Dec 20, 2019 at 1:50 AM slih-sh notifications@github.com wrote:

The max-token is 8192(16GB GPU) and I trained the model up to 300k，now the score is 23.82. But there is still a 0.79 gap, any idea?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/facebookresearch/Mask-Predict/issues/6?email_source=notifications&email_token=AJQ6TRIP3NIP2Z7XQZCTYF3QZSIPTA5CNFSM4JKORL52YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHMOUPA#issuecomment-567863868, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJQ6TRPOOG4GU6SQ6HP7NN3QZSIPTANCNFSM4JKORL5Q .

-- Best Regards,

Yinhan Liu

Graduate Student at the University of Texas at Austin

alphadl commented 4 years ago

@slih-sh did you preprocess your train/dev/test data all at the same time? and what was your ppl on valid of your trained model?

my current model performance on valid set shows:

However, the BLEU score of the test set is merely 24.8. Do you have suggestions for improving performance? @yinhanliu

yinhanliu commented 4 years ago

@alphadl can you give me more info about what you are training on? your batch size in term of number of tokens, and gpus, and your dataset -- distilled or original data?

alphadl commented 4 years ago

@alphadl can you give me more info about what you are training on? your batch size in term of number of tokens, and gpus, and your dataset -- distilled or original data?

distilled data/ Batch size is 8192 tokends * 4 GPUs with update_freq = 4.

alphadl commented 4 years ago

BTW. I suggest that you can add the post-process script compound_split_bleu.sh in evaluation phase such that we can obtain the reported BLEU score in your paper.

yinhanliu commented 4 years ago

@alphadl can you give me more info about what you are training on? your batch size in term of number of tokens, and gpus, and your dataset -- distilled or original data?

distilled data/ Batch size is 8192 tokends * 4 GPUs with update_freq = 4.

Parameters look good to me. usually we train the model for 1.5 days and for your case (4Update freq), it might take more than 3 days. Also can you share how you generate your distilled data?

alphadl commented 4 years ago

@alphadl can you give me more info about what you are training on? your batch size in term of number of tokens, and gpus, and your dataset -- distilled or original data?

distilled data/ Batch size is 8192 tokends * 4 GPUs with update_freq = 4.

Parameters look good to me. usually we train the model for 1.5 days and for your case (4Update freq), it might take more than 3 days. Also can you share how you generate your distilled data?

Thanks for your prompt reply, the distilled data is derived from the strong AT model scaling NMT.

yeliu918 commented 4 years ago

Hi,

I also just get 9.75 for BLEU 4 on wmt14.en_de (far less than the paper report) but 30.11 for BLEU 4 on wmt14.en_de (almost same in the paper) using your leased trained model. I use the get_data.sh to processing the data and I check the vocabulary is same with the your released MaskPredict/checkpoint_best.pt.

jungokasai commented 4 years ago

I have some followup on this: Figure 3 in https://arxiv.org/pdf/2001.05136.pdf. It looks that the CMLM's performance deteriorates by more than 0.5 BLEU points by halving the overall batch size from the original --max-tokens 8192 --distributed-world-size 16. Because of the nature of CMLM training where only masked tokens are predicted, a bigger batch size is required to perform well. Also it didn't converge with --max-tokens 8192 --distributed-world-size 1.

facebookresearch / Mask-Predict

can not reproduce the result on wmt14 en-de #6