Prefix-Tuning: Optimizing Continuous Prompts for Generation
Hi Lisa~ I rewrite the code refer to yours on BART based on the newest huggingface transformers, and I want to verify a thing that according to my training procedure, the speed of the prefix-training is about 60%~70% of the all parameter finetune, even when I used a very very small prefix prompt module. I want to ask for your help that: does that make sense? And where may be the bottle neck of the speed? Hope for you reply.

I think 60%-70% makes sense!

Great question: the speed gains in prefix-tuning happens because you don't have to update as many parameters that's stored in the optimizer (aka fewer trainable parameters), but backprop is still required all the way to the bottom Transformer layer. One thought experiment that could explain this is as follows: imagine when you only train the last one layer of a Transformer model, then both number of trainable parameter and the required number of backprop layer reduced (you only need to backprop one layer, since you are not interested in the gradients of first couple layers). However, if you only train the first layer of the Transformer, then you need backprop all the way, despite the same number of trainable parameters.

Based on the first layer v.s. last layer analogy, let's go back to prefix-tuning. We tune all activation layers, and therefore we need to backprop all the way back to the first layer, so backprop time is not reduced. The only reduced computation is that we don't need to do as much updates.

Let me know if this makes sense.

Great thank for your analysis! I assume for the same reasons too233. thx again!

Of course, it depends on the model, I think 11GB memory is enough for e2e dataset in GPT2.

