THUDM / P-tuning

A novel method to tune language models. Codes and datasets for paper ``GPT understands, too''.
MIT License
923 stars 111 forks source link

Why discreteness of word embedding leads to the optimizer easily fall into local minima? #47

Open xsc1234 opened 1 year ago

xsc1234 commented 1 year ago

最近拜读了您的论文《GPT Understands, Too》,关于这段话有些不理解,希望您能帮忙指导解释下:”1) Discreteness: the original word embedding e of M has already become highly discrete after pre-training. If h is initialized with random distribution and then optimized with stochastic gradient descent (SGD), which has been proved to only change the parameters in a small neighborhood (AllenZhu et al., 2019), the optimizer would easily fall into local minima.” 按照我的理解,您这段话先说明预训练模型的词向量彼此之间相互离散,但是可训练参数h本身就是随机初始化的,并不来自于词向量,词向量的离散对h的优化有什么影响吗?

init-neok commented 1 year ago

我简单谈谈自己的理解哈,我看过作者的在BAAI上的分享,我觉得这句话的意思是随机初始化的本身引入的Pseudo Prompt就不是真正意义上的词,为了让他们更加符合语言意义上的特性,因此作者说这句话是为了引出后面使用BiLstm来初始化的这些pseudo Prompt的想法,而且P-tuing算是一种连续提示,使用lstm来初始化也更符合连续这个特性。