facebookresearch / Mask-Predict

A masked language modeling objective to train a model to predict any subset of the target words, conditioned on both the input text and a partially masked target translation.
Other
240 stars 38 forks source link

Will you release the distillation dataset of wmt-en-de? #3

Closed SunbowLiu closed 5 years ago

SunbowLiu commented 5 years ago

Hi,

I have successfully reproduced the 27.03 BLEU score (N=10, l=5) and 1.2 times speedup (N=10, l=2) using your pre-trained wmt-en-de model.

I wanna train the model from scratch but the performance heavily relies on the distillation dataset you used (With raw data, I can only gain ~24 BLEU score), so it would be much better if you can provide this dataset.

Thank you!

ftakanashi commented 5 years ago

Hi,

I have successfully reproduced the 27.03 BLEU score (N=10, l=5) and 1.2 times speedup (N=10, l=2) using your pre-trained wmt-en-de model.

I wanna train the model from scratch but the performance heavily relies on the distillation dataset you used (With raw data, I can only gain ~24 BLEU score), so it would be much better if you can provide this dataset.

Thank you!

Hello there. I tried to reproduce en-de results in the paper, but I can only got about 22.6 BLEU score. Could you tell me some detail about your reproduction? Like what dataset did you use and what were the other hyperparameters? It would be very helpful if you can give me some information. Thx!!

SunbowLiu commented 5 years ago

Hi, I have successfully reproduced the 27.03 BLEU score (N=10, l=5) and 1.2 times speedup (N=10, l=2) using your pre-trained wmt-en-de model. I wanna train the model from scratch but the performance heavily relies on the distillation dataset you used (With raw data, I can only gain ~24 BLEU score), so it would be much better if you can provide this dataset. Thank you!

Hello there. I tried to reproduce en-de results in the paper, but I can only got about 22.6 BLEU score. Could you tell me some detail about your reproduction? Like what dataset did you use and what were the other hyperparameters? It would be very helpful if you can give me some information. Thx!!

Hi,

All hyperparameters are the same as the paper and the provided script. The data set is https://drive.google.com/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8 They use the batch size with 16*8192, so if you have only 8 V100s, you should set --update-freq to 2. These make it possible to train a model with ~24 BLEU score as reported by the paper and my experimental results.

Thank you!

ftakanashi commented 5 years ago

Hi, I have successfully reproduced the 27.03 BLEU score (N=10, l=5) and 1.2 times speedup (N=10, l=2) using your pre-trained wmt-en-de model. I wanna train the model from scratch but the performance heavily relies on the distillation dataset you used (With raw data, I can only gain ~24 BLEU score), so it would be much better if you can provide this dataset. Thank you!

Hello there. I tried to reproduce en-de results in the paper, but I can only got about 22.6 BLEU score. Could you tell me some detail about your reproduction? Like what dataset did you use and what were the other hyperparameters? It would be very helpful if you can give me some information. Thx!!

Hi,

All hyperparameters are the same as the paper and the provided script. The data set is https://drive.google.com/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8 They use the batch size with 16*8192, so if you have only 8 V100s, you should set --update-freq to 2. These make it possible to train a model with ~24 BLEU score as reported by the paper and my experimental results.

Thank you!

Thank you very much. The reason why I couldn't reproduce the result seems to be the problem of the preprocess of the data. I lowercased all my data and lead to too much representations in the corpus. When I directly use your data, it works! Thanks again!

yinhanliu commented 5 years ago

Hi,

I have successfully reproduced the 27.03 BLEU score (N=10, l=5) and 1.2 times speedup (N=10, l=2) using your pre-trained wmt-en-de model.

I wanna train the model from scratch but the performance heavily relies on the distillation dataset you used (With raw data, I can only gain ~24 BLEU score), so it would be much better if you can provide this dataset.

Thank you!

Thanks for your interest. Please use the code here

https://github.com/pytorch/fairseq/tree/master/examples/translation

with this command

python train.py your-data-bin --arch transformer --share-all-embeddings --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --lr 5e-4 --warmup-init-lr 1e-7 --min-lr 1e-9 --lr-scheduler inverse_sqrt --warmup-updates 4000 --optimizer adam --adam-betas '(0.9, 0.98)' --max-tokens 8192 --dropout 0.3 --encoder-layers 6 --encoder-embed-dim 1024 --decoder-layers 6 --decoder-embed-dim 1024 --max-update 300000 --update-freq 2 --fp16 --max-source-positions 10000 --max-target-positions 10000 --save-dir checkpoints

ftakanashi commented 4 years ago

Hi,

I have successfully reproduced the 27.03 BLEU score (N=10, l=5) and 1.2 times speedup (N=10, l=2) using your pre-trained wmt-en-de model.

I wanna train the model from scratch but the performance heavily relies on the distillation dataset you used (With raw data, I can only gain ~24 BLEU score), so it would be much better if you can provide this dataset.

Thank you!

Hello, Liu! Thanks for giving the test dataset to me last month. Now I am also in the phase trying to train the model from scratch but encounter the problem as you did. I've tried the command Mr.yinhanliu provided above to generate distillation dataset. However, model still has a bad performance after being trained on distillation dataset. I wonder that did you reproduce the BLEU in paper after using distillation data?

SunbowLiu commented 4 years ago

Hi, I have successfully reproduced the 27.03 BLEU score (N=10, l=5) and 1.2 times speedup (N=10, l=2) using your pre-trained wmt-en-de model. I wanna train the model from scratch but the performance heavily relies on the distillation dataset you used (With raw data, I can only gain ~24 BLEU score), so it would be much better if you can provide this dataset. Thank you!

Hello, Liu! Thanks for giving the test dataset to me last month. Now I am also in the phase trying to train the model from scratch but encounter the problem as you did. I've tried the command Mr.yinhanliu provided above to generate distillation dataset. However, model still has a bad performance after being trained on distillation dataset. I wonder that did you reproduce the BLEU in paper after using distillation data?

Hi,

I have successfully trained the wmt-en-de from scratch. I use the distillation data set produced by a powerful Transformer big model (~29.3 BLEU score)(https://github.com/pytorch/fairseq/blob/master/examples/scaling_nmt/README.md#pre-trained-models) which can reproduce a final BLEU score >27.2. Note that mask-predict use the batch size with 16*8192, so if you have only 8 V100s, you should set --update-freq to 2.

PanXiebit commented 4 years ago

Hi @SunbowLiu
Thank you for the information you have provided. But there isn't de->en pretrained model in (https://github.com/pytorch/fairseq/blob/master/examples/scaling_nmt/README.md#pre-trained-models).

Do you have any advice?

SunbowLiu commented 4 years ago

Hi @SunbowLiu Thank you for the information you have provided. But there isn't de->en pretrained model in (https://github.com/pytorch/fairseq/blob/master/examples/scaling_nmt/README.md#pre-trained-models).

Do you have any advice?

The only way might be training from scratch.

dmortem commented 3 years ago

Hi, when I used the checkpoint_best.pt provided in readme and the inference script "python generate_cmlm.py ${output_dir}/data-bin --path ${model_dir}/checkpoint_best.pt --task translation_self --remove-bpe --max-sentences 20 --decoding-iterations 10 --decoding-strategy mask_predict", I can only got the bleu of 20.90. What is the problem? Are there any other hyperparameters I need to modify in the inference script?

I see "average the 5 best checkpoints to create the final model" in the paper. So is the checkpoint_best.pt provided in the link the final model? If not, I wonder how to average the best checkpoints? Do we forward 5 models and average the prediction distribution?

Thank you!