THUDM / P-tuning

A novel method to tune language models. Codes and datasets for paper ``GPT understands, too''.
MIT License
923 stars 111 forks source link

few-shot实验encoder换成bert-base-cased效果差很多 #12

Closed Life-0-1 closed 3 years ago

Life-0-1 commented 3 years ago

你好,非常感谢你们的开源代码。 在复现过程中,我产生了以下两个疑问,望解答:

  1. few-shot实验中,把encoder从albert-xxlarge-v2改成bert-base-cased,其他不变,效果下降非常多(在wic, rte数据集上acc只有50%上下)。这仅仅是由于encoder容量的关系吗?是否还有一些重要参数需要调节?
  2. 我在用开源代码复现论文结果的时候,发现CB这个数据上结果差别很大,如图,左边是我的结果,右边是论文结果,这可能是什么原因呢? image
zheng-yanan commented 3 years ago

你好,非常感谢你们的开源代码。 在复现过程中,我产生了以下两个疑问,望解答:

  1. few-shot实验中,把encoder从albert-xxlarge-v2改成bert-base-cased,其他不变,效果下降非常多(在wic, rte数据集上acc只有50%上下)。这仅仅是由于encoder容量的关系吗?是否还有一些重要参数需要调节?
  2. 我在用开源代码复现论文结果的时候,发现CB这个数据上结果差别很大,如图,左边是我的结果,右边是论文结果,这可能是什么原因呢? image

Hi!

Thanks for your attention.

  1. In the few-shot experiments, both PET and P-tuning use albert-xxlarge-v2 to gain their respective best performance. Generally, the performance is closely related to the FLOPs of pretrained models. Since ALBERT enforces parameter-sharing, it gains better FLOPs and appears to be more efficient than BERT (of any scales).

  2. Thanks for pointing it out. I'm sorry that I find the CB script was mistaken, and I will update it as soon as possible.

The results of current script was also recorded through comments within, which shows that your reported f1-macro is a little bit lower than the commented one. According to previous experiences, several factors could have huge influence on the final performance:

  1. Please set exact the same version of environments as is given.
  2. Experimental results show that in few-shot setting, number of GPUs affect a lot. For example, given batch_size = 16, the following settings would lead to totally different results. a. 2 per_gpu_batch_size 8 n_gpu 1 accumulation_steps b. 8 per_gpu_batch_size 2 n_gpu 1 accumulation_steps c. 4 per_gpu_batch_size 2 n_gpu 2 accumulation_steps
  3. Please keep the seed as default value.

Please feel free to share with us if there're other problems. Thank you.

Riroaki commented 3 years ago

你好,非常感谢你们的开源代码。 在复现过程中,我产生了以下两个疑问,望解答:

  1. few-shot实验中,把encoder从albert-xxlarge-v2改成bert-base-cased,其他不变,效果下降非常多(在wic, rte数据集上acc只有50%上下)。这仅仅是由于encoder容量的关系吗?是否还有一些重要参数需要调节?
  2. 我在用开源代码复现论文结果的时候,发现CB这个数据上结果差别很大,如图,左边是我的结果,右边是论文结果,这可能是什么原因呢? image

Hi!

Thanks for your attention.

  1. In the few-shot experiments, both PET and P-tuning use albert-xxlarge-v2 to gain their respective best performance. Generally, the performance is closely related to the FLOPs of pretrained models. Since ALBERT enforces parameter-sharing, it gains better FLOPs and appears to be more efficient than BERT (of any scales).
  2. Thanks for pointing it out. I'm sorry that I find the CB script was mistaken, and I will update it as soon as possible.

The results of current script was also recorded through comments within, which shows that your reported f1-macro is a little bit lower than the commented one. According to previous experiences, several factors could have huge influence on the final performance:

  1. Please set exact the same version of environments as is given.
  2. Experimental results show that in few-shot setting, number of GPUs affect a lot. For example, given batch_size = 16, the following settings would lead to totally different results. a. 2 per_gpu_batch_size 8 n_gpu 1 accumulation_steps b. 8 per_gpu_batch_size 2 n_gpu 1 accumulation_steps c. 4 per_gpu_batch_size 2 n_gpu 2 accumulation_steps
  3. Please keep the seed as default value.

Please feel free to share with us if there're other problems. Thank you.

你好,请问cb的训练脚本什么时候可以更新?我的结果和复现效果和论文中也有差距