THUDM / P-tuning

A novel method to tune language models. Codes and datasets for paper ``GPT understands, too''.
MIT License
923 stars 111 forks source link

gpt2-medium LAMA #17

Open casually-PYlearner opened 3 years ago

casually-PYlearner commented 3 years ago

Hi, i have just used the default params to p-tune the gpt2-medium on LAMA task and the results is as follows. best dev_hit@1: 51.8 best test_hit@1: 44.5 For the results I got, I have some confusions... (1) It seems that there is a gap between the dev results and the test results. Are the dev set and the test set in the same distribution? Is it possible to provide the scipts of generating the train/dev/test sets and the original dataset? (2) The results reported in the paper is 46.5, which is close to the best test_hit@1. Are the results in the paper based on the test set? It will be very nice if the shell scipts is provided to reproduce the results in the paper.

zhaochen0110 commented 3 years ago

hi, I also use the params to p-tune the LAMA task, meeting the same questions when using bert-base-uncased. My best dev_hit@1: 75.1 best test_hit@1: 85.2 However, the results reported in the paper is 52.3. Does you meet the same question? Has your question been solved?

lancorrect commented 1 year ago

hi, I also use the params to p-tune the LAMA task, meeting the same questions when using bert-base-uncased. My best dev_hit@1: 75.1 best test_hit@1: 85.2 However, the results reported in the paper is 52.3. Does you meet the same question? Has your question been solved?

Hi,

The problem you mentioned may be caused by runing the codes in single subdataset like P1001. I wonder author ran his codes in whole dataset and averaged all results. Maybe you can have a try and verify if I'm right.