Difference between P-tuning and -v2 in the codes

Hi, thanks for your working :)

I've read your paper and tried to understand p-tuning-v2 with the implementation codes to apply them to GPT2. (Actually I've already coded but i'm not sure if I did it correctly) I understand that p-tuning-v2 could work via past_key_value which includes output of the prefix encoder.

Upon the codes, I've found that the main difference between p-tuning and -v2 is also about input and output shape. For v1, input_ids and prompt embedding are concatenated, which are directly injected to the model, and the output logits include prompt output which is not used to calculate the loss. On the other hand, for v2, prefix encoder is injected via past_key_value and the original input_ids and past_key_value are injected to the model, which is different from the input for v1. So the output logits here do not include first prompt length logits.

I'd appreciate it if you could check if I understood correctly. Thanks!

THUDM / P-tuning-v2

Difference between P-tuning and -v2 in the codes #40