Prompt Length in SuperGLUE

THUDM / P-tuning

A novel method to tune language models. Codes and datasets for paper ``GPT understands, too''.

MIT License

924 stars 111 forks source link

Hello,

Thank you for your work and for releasing the code! I just have a few questions regarding the size of the prompt embeddings:

From the scripts you shared for the superglue tasks, the pattern id chosen is 1 for most tasks (Except wsc with 2). If I understood correctly, you discard the original notion of patterns and use the pattern id to denote the number of prompt embeddings you are going to train. Does this mean you are using a single prompt embedding vector in most tasks?
If so, is there a specific reason why LSTM performs better than MLP in this case? If I understood correctly, one of the reasons why LSTM was used is to help with the association problem and to make the different prompt embeddings dependent. Would this problem exist given just 1 prompt embedding?

Thank you for your work and cooperation!

Hello,

Thank you for your work and for releasing the code! I just have a few questions regarding the size of the prompt embeddings:

From the scripts you shared for the superglue tasks, the pattern id chosen is 1 for most tasks (Except wsc with 2). If I understood correctly, you discard the original notion of patterns and use the pattern id to denote the number of prompt embeddings you are going to train. Does this mean you are using a single prompt embedding vector in most tasks?

If so, is there a specific reason why LSTM performs better than MLP in this case? If I understood correctly, one of the reasons why LSTM was used is to help with the association problem and to make the different prompt embeddings dependent. Would this problem exist given just 1 prompt embedding?

Thank you for your work and cooperation!

Yes, the pattern_id is used to denote the number of prompt embeddings. For few-shot tasks, most tasks adapt single prompt embedding.
According to our experiment results, in the few-shot setting with single prompt embedding, both LSTM and MLP yields similar results with subtle differences. The association problem is more obvious in knowledge probing with 6-9 prompt embeddings.

Thank you.

THUDM / P-tuning

Prompt Length in SuperGLUE #6