THUDM / P-tuning

A novel method to tune language models. Codes and datasets for paper ``GPT understands, too''.
MIT License
912 stars 111 forks source link

questions about discreteness in optimization #39

Closed skygl closed 2 years ago

skygl commented 2 years ago

Hi! Thank you for the interesting paper. While I was reading the paper, I'd like to ask you something I don't understand.

In 3.2 Optimization part of the paper,

If h is initialized with random distribution and then optimized with stochastic gradient descent (SGD), 
which has been proved to only change the parameters in a small neighborhood (AllenZhu et al., 2019), 
the optimizer would easily fall into local minima.

Is the problem with the discreteness of word embedding $e$ in the optimization? could you explain about this in more detail?

the second question is why the proposed prompt encoder encourage discreteness. maybe it might be connected to the first question.

Thank you.

Xiao9905 commented 2 years ago

@skygl Hi,

Thanks for your interest in P-tuning!

The discreteness here refers to how embedding be like in the embedding space. Since embeddings are vectors in the continuous space, and the gradient optimization may suffer from

only change the parameters in a small neighborhood

, at that time we supposed that some poor results we had met would be attributed to this optimization challenge (i.e., the prompt embeddings were only changed locally and were not optimized enough). Prompt encoders serve as reparameterization (also used in prefix-tuning (Li & Liang, 2021)), which consist nonlinear activation functions to encourage output embeddings to be optimized to farther locations in the embedding space.

However, in recent months we gradually understand that, even without reparameterization, the embeddings are still being drastically changed by gradients. So our hypothesized explanation at that time, that prompt embeddings without prompt encoders can not be optimized out of its neighborhood, does not really hold. The optimization challenge still exists, but now we think it may lie in somewhere else along the training of P-tuning.

I am not sure if my answer is clear enough. If you have other questions, please feel free to ask.

skygl commented 2 years ago

@Xiao9905

Thanks for quick answers! It helped me a lot to understand 👍