THUDM / P-tuning

A novel method to tune language models. Codes and datasets for paper ``GPT understands, too''.
MIT License
912 stars 111 forks source link

Few-shot NLU: learning rate for model parameters vs. embedding parameters #7

Closed nelson-liu closed 3 years ago

nelson-liu commented 3 years ago

Hi!

Thanks for the interesting paper and releasing this nice codebase! I had a quick question with respect to the learning rate used for the fewshot NLU experiments. The paper mentions (Section 4.2) that:

We perform grid search of hyper-parameters and take the best combination on Ddev or Ddev32. Specifically, we take learning rates from 1e-5, 2e-5, 3e-5 and batch sizes from 16, 32

However, it seems like the model is updated with a fixed learning rate of 1e-5 in the code ( https://github.com/THUDM/P-tuning/blob/main/PT-Fewshot/pet/wrapper.py#L312 ) , and the learning rate taken from the CLI is only used for the embedding parameters.

Given that the paper and code seem to differ in this regard, I'm not sure if this is a bug in the code (i.e., the model and the embedding parameters should always use the LR taken from the CLI) or if the paper omits this detail (i.e., in reality, the LR grid search is only done on the embedding parameters, and 1e-5 is always used for the model). Could yo clarify which approach was taken in your experiments?

Thanks again!

nelson-liu commented 3 years ago

ah, rereading that passage, am I correct in that the grid-search is not used in the few-shot setting (and the default hyperparameters from PET are used)?

zheng-yanan commented 3 years ago

ah, rereading that passage, am I correct in that the grid-search is not used in the few-shot setting (and the default hyperparameters from PET are used)?

Hi!

Yes, in the few-shot setting, the hyperparameters from PET are used and we additionally select hyperparameters for prompt-related ones. Actually, we've experimented to use the same/different learning rates for both the backbone and prompt embeddings, and find that using different learning rates yields better performance in the few-shot setting. The grid-search mentioned in the paper was used in the fully supervised setting.

Thank you.