THUDM / P-tuning-v2

An optimized deep prompt tuning strategy comparable to fine-tuning across scales and tasks
Apache License 2.0
1.96k stars 198 forks source link

DeBERTa P-Tuning v2 speed #36

Closed kefirski closed 2 years ago

kefirski commented 2 years ago

I've observed that training DeBERTaV2 with P-Tuning v2 takes significantly more time to evaluate than other methods. Have you observed such behaviour?

It even takes significantly more time than P-Tuning v1 despite the fact that v1 have larger complexity to evaluate attention.

It seems like the issue is the ad-hoc implementation of past_key_values for DeBERTa, which is the only difference in the code of backbone model between v1 and v2, but can't figure out the specific reason for so.

kefirski commented 2 years ago

The same holds for DeBERTa V1. I also have noticed that for P-Tuning v2 GPU utilization is dramatically lower compared to P-Tuning v1

dyh1998 commented 2 years ago

I have a question that which is faster between p-tuning-v2 and fine-tune. To my understanding, p-tuning-v2 should be slowly than p-tuning, that becaues it updates more parameter than p-tuning.

kefirski commented 2 years ago

While P-Tuning v1 has fewer parameters to update, the evaluation of the attention mechanism for this scheme requires O((n + d)^2) operations to compute the result since sequence length is naively extended by prompt with length d.

For P-Tuning v2, the complexity is equal to O(n(n + d)) since only tokens from the original sequence are attended to prefixes added to computations via past_key_values. Furthermore, you don't have to evaluate the remaining Transformer layers (e.g., Positional-Wise mappings) on prefixes like P-Tuning v1.

kefirski commented 2 years ago

I observed that the evaluation of P-Tuning v2 for a prefix with the length 100 is about twice as fast as P-Tuning v1 for RoBERTa. Although, for DeBERTa, P-Tuning v2 for about 40 times slower which does not seem to be a legit result

dyh1998 commented 2 years ago

Alright, thanks for your patient explanations, that make me relaize i have a shortage of model's architecture details. And i don't know much about DeBERTa. Maybe somebody can help you.

Best,

Xiao9905 commented 2 years ago

@kefirski Hi,

Thanks for your interest in our work! Sorry for the late reply.

It seems like the issue is the ad-hoc implementation of past_key_values for DeBERTa, which is the only difference in the code of backbone model between v1 and v2, but can't figure out the specific reason for so.

Yes, I think this is the reason. At the time when we are experimenting P-Tuning v2, DeBERTa had not yet being officially implemented in huggingface transformers, and thus we implement the past_key_value functions ourselves. We are sorry that our own implementation can be slower than the huggingface official implemented one.