Inconsistent SuperGLUE Results from P-Tuning and P-TuningV2 Paper

THUDM / P-tuning

A novel method to tune language models. Codes and datasets for paper ``GPT understands, too''.

MIT License

912 stars 111 forks source link

Inconsistent SuperGLUE Results from P-Tuning and P-TuningV2 Paper #33

Closed theoqian closed 2 years ago

theoqian commented 2 years ago

Hi, I find most of the SuperGLUE metrics of PT reported in P-Tuning paper are superior to metrics of fine-tuning. But the metrics of PT reported in P-TuningV2 paper are much worse than fine-tuning. For example in BoolQ tasks, in P-Tuning paper the acc is 72.9 for fine-tuning and 73.9 for PT. While in P-TuningV2 paper the acc is 77.7 for fine-tuning and 67.2 for PT.

It seems that from P-TuningV2 paper is much worse than fine-tuning which is opposite to the conclusion from P-Tuning paper.

Xiao9905 commented 2 years ago

@theoqian Hi,

I think this issue in P-tuning v2 repo asks the same problem as yours. It is because in P-tuning v2 we report a fixed backbone version of P-tuning v1 to follow the experimental setting of Lesters et al.

theoqian commented 2 years ago

Thanks for your reply.