Open Harry-zzh opened 2 years ago
Same reproduce problem, can not get a result as good as published in paper. We really appreciate it if you can offer the experiment setting and clean code. Thanks a lot!
Hi @Harry-zzh, thanks for your interest in our work! Just to confirm, 3% lower than reported means absolutely, right? Then this is lower than all baselines in Table 1, even vanilla KD?
@MichaelZhouwang and I will take a closer look and it'll be great if you can share the exact command used with us.
Hi @Harry-zzh First, could you please share on which dataset you conduct your experiments? If it is some small datasets, 3% variation may indeed come from different random seeds. Otherwise, can you share the exact command for your best result on the task.
Also you may check the following points:
Thank your for your reply. @JetRunner @MichaelZhouwang
Model | MRPC (F1/Acc.) | RTE (Acc.) | SST-2 (Acc.) | STS-B (Pear./Spear.) | MNLI ( Acc.) | QNLI (Acc.) | QQP (F1/Acc.) |
---|---|---|---|---|---|---|---|
Vanilla KD (mine) | 86.2/80.3 | 64.7 | 91.7 | 83.4/81.9 | 80.4/79.8 | 87.5 | 69.7/88.6 |
Vanilla KD [1] | 86.2/80.6 | 64.7 | 91.5 | / | 80.2/79.8 | 88.3 | 70.1/88.8 |
Model | MRPC (F1/Acc.) | RTE (Acc.) | SST-2 (Acc.) | STS-B (Pear./Spear.) | MNLI ( Acc.) | QNLI (Acc.) | QQP (F1/Acc.) |
---|---|---|---|---|---|---|---|
Meta Distill (mine) | 85.2/79.5 | 65.6 | 91.4 | 83.1/81.4 | 80.8/80.0 | 87.4 | 70.1/88.5 |
Meta Distill [2] | 88.7/84.7 | 67.2 | 93.5 | 86.1/85.0 | 83.8/83.2 | 90.2 | 71.1/88.9 |
As you can see, almost all the results on the test set are 3% lower than your reported results. I can reproduce the results of KD listed in [1] but yours are significantly higher than theirs, I can’t reproduce.
Model | MRPC (F1/Acc.) | RTE (Acc.) | SST-2 (Acc.) | STS-B (Pear./Spear.) | MNLI ( Acc.) | QNLI (Acc.) | QQP (F1/Acc.) |
---|---|---|---|---|---|---|---|
BERT-Base (mine) | 89.0/85.2 | 69.5 | 93.2 | 87.2/85.9 | 84.3/83.9 | 91.1 | 71.5/89.2 |
BERT-Base [2] | 88.9/84.8 | 66.4 | 93.5 | 87.1/85.8 | 84.6/83.4 | 90.5 | 71.2/89.2 |
And my teacher achieves even better performance than your reported results.
References: [1] Sun S, Cheng Y, Gan Z, et al. Patient Knowledge Distillation for BERT Model Compression[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019: 4323-4332. [2] Zhou W, Xu C, McAuley J. BERT learns to teach: Knowledge distillation with meta learning[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022: 7037-7049.
@Harry-zzh Thanks for the info. Is it on test set (i.e., GLUE server) or validation set? If it's on test set, could you please also provide the results on the development set?
@MichaelZhouwang could you give it a look?
By the way, in NLP experiments, the students in our implementation of KD and our approach are initialized with pretrained BERT (well-read student) rather than fine-tuned teacher. That's probably the reason why the vanilla KD reported by us is significantly higher? (See the caption under Table 1)
@Harry-zzh Can you share the exact command for your best result on the task? Also, can you share the results on the dev set of the GLUE benchmark? You can first focus on reproducing the results on the dev set.
Thanks for your reply. I have shown the results on the test set before, and the results on the dev set are as follows:
Model | MRPC (F1/Acc.) | RTE (Acc.) | SST-2 (Acc.) | STS-B (Pear./Spear.) | MNLI ( Acc.) | QNLI (Acc.) | QQP (F1/Acc.) |
---|---|---|---|---|---|---|---|
Vanilla KD (ours) | 89.6/84.8 | 68.6 | 91.7 | 88.6/88.5 | 80.9 | 87.7 | 86.6/90.1 |
Meta Distill (ours) | 89.4/84.3 | 69.3 | 91.3 | 88.3/88.0 | 81.3 | 87.9 | 87.2/90.4 |
BERT-Base (ours) | 91.6/88.2 | 73.3 | 93.1 | 89.8/89.4 | 85.1 | 91.6 | 88.0/91.1 |
I try grid search over the sets of the hyper-parameters as I described before, and I choose the best checkpoint on the dev set to make predictions on the test set. An example of my command on MNLI dataset is :
python nlp/run_glue_distillation_meta.py --model_type bert --teacher_model nlp/bert-base-finetuned/mnli --student_model nlp/bert-base-finetuned/mnli --num_hidden_layers 6 --alpha 0.5 --task_name MNLI --do_train --do_eval --beta 0 --do_lower_case --data_dir nlp/glue_data/MNLI --assume_s_step_size 2e-05 --per_gpu_train_batch_size 32 --per_gpu_eval_batch_size 32 --learning_rate 2e-05 --teacher_learning_rate 2e-06 --max_seq_length 128 --num_train_epochs 5 --output_dir output/mnli --warmup_steps 200 --gradient_accumulation_steps 2 --temperature 5 --seed 42 --logging_rounds 1000 --save_steps 1000
And, @JetRunner said your approach is initialized with pretrained BERT (well-read student), and @MichaelZhouwang said your approach is initialized with fine-tuned teacher. I feel a bit confused.
Looking forward to your reply, and I would be grateful if you could offer exact experiment settings on each dataset.
Hi, that was my mistake. The teacher is initialized by pretrained BERT (well-read student). But using fine-tuned teacher should be able to achieve similar performance.
First I think you should change --num_held_batches from 0 to something like 1/2/4, which introduces randomness for teacher training and speeds up the training process. Also, I think for MNLI you should use larger warmup steps such as 1000/1500/2000, a larger alpha such as 0.6/0.7, a lower temperature such as 2/3, a smaller logging_rounds such as 200, and a larger teacher lr such as 5e-6. And you may need to add some regularization such as weight_decay. You should be able to achieve something higher than 83.5 on MNLI without much difficulty.
For smaller datasets such as MRPC, a (effective) batch size of 64 is certainly too large. Also, you should carefully tune the warmup steps together with the num_train_epochs.
Unfortunately we can not offer you exact experiment settings on each dataset because we are no longer having it. Nevertheless, the notes above are some tips we can offer.
For further questions, maybe you can send me an email with your wechat ID to wcszhou@outlook.com so that I can offer further guidance and help more promptly and conveniently.
Hi, that was my mistake. The teacher is initialized by pretrained BERT (well-read student). But using fine-tuned teacher should be able to achieve similar performance.
First I think you should change --num_held_batches from 0 to something like 1/2/4, which introduces randomness for teacher training and speeds up the training process. Also, I think for MNLI you should use larger warmup steps such as 1000/1500/2000, a larger alpha such as 0.6/0.7, a lower temperature such as 2/3, a smaller logging_rounds such as 200, and a larger teacher lr such as 5e-6. And you may need to add some regularization such as weight_decay. You should be able to achieve something higher than 83.5 on MNLI without much difficulty.
For smaller datasets such as MRPC, a (effective) batch size of 64 is certainly too large. Also, you should carefully tune the warmup steps together with the num_train_epochs.
Unfortunately we can not offer you exact experiment settings on each dataset because we are no longer having it. Nevertheless, the notes above are some tips we can offer.
Thanks, I will have a try.
Hi, that was my mistake. The teacher is initialized by pretrained BERT (well-read student). But using fine-tuned teacher should be able to achieve similar performance. First I think you should change --num_held_batches from 0 to something like 1/2/4, which introduces randomness for teacher training and speeds up the training process. Also, I think for MNLI you should use larger warmup steps such as 1000/1500/2000, a larger alpha such as 0.6/0.7, a lower temperature such as 2/3, a smaller logging_rounds such as 200, and a larger teacher lr such as 5e-6. And you may need to add some regularization such as weight_decay. You should be able to achieve something higher than 83.5 on MNLI without much difficulty. For smaller datasets such as MRPC, a (effective) batch size of 64 is certainly too large. Also, you should carefully tune the warmup steps together with the num_train_epochs. Unfortunately we can not offer you exact experiment settings on each dataset because we are no longer having it. Nevertheless, the notes above are some tips we can offer.
Thanks, I will have a try.
@Harry-zzh hi, so, can you reproduce the results as shown in this paper now?
@Harry-zzh @Hakeyi Hi! Were you able to reproduce the results? If yes, is it possible to share your findings? Thanks a lot!
@Harry-zzh @Hakeyi Hi! Were you able to reproduce the results? If yes, is it possible to share your findings? Thanks a lot!
Sorry for late reply. I fail to reproduce the results.
Hi, that was my mistake. The teacher is initialized by pretrained BERT (well-read student). But using fine-tuned teacher should be able to achieve similar performance. First I think you should change --num_held_batches from 0 to something like 1/2/4, which introduces randomness for teacher training and speeds up the training process. Also, I think for MNLI you should use larger warmup steps such as 1000/1500/2000, a larger alpha such as 0.6/0.7, a lower temperature such as 2/3, a smaller logging_rounds such as 200, and a larger teacher lr such as 5e-6. And you may need to add some regularization such as weight_decay. You should be able to achieve something higher than 83.5 on MNLI without much difficulty. For smaller datasets such as MRPC, a (effective) batch size of 64 is certainly too large. Also, you should carefully tune the warmup steps together with the num_train_epochs. Unfortunately we can not offer you exact experiment settings on each dataset because we are no longer having it. Nevertheless, the notes above are some tips we can offer.
Thanks, I will have a try.
@Harry-zzh hi, so, can you reproduce the results as shown in this paper now?
Sorry for late reply. I fail to reproduce the results.
Thanks for your excellent work. I tried to do grid search on the settings that you described in your paper and codes, but it is still hard for me to reproduce the results of GLUE benchmark. My experiment results on both the dev set and test set are about 3% lower than yours. I would be very grateful if you could offer exact experiment settings on each dataset, or codes that can reproduce the results of GLUE benchmark. Looking forward to your reply, thank you !