Why discreteness of word embedding leads to the optimizer easily fall into local minima?

THUDM / P-tuning

A novel method to tune language models. Codes and datasets for paper ``GPT understands, too''.

MIT License

923 stars 111 forks source link

最近拜读了您的论文《GPT Understands, Too》，关于这段话有些不理解，希望您能帮忙指导解释下：”1) Discreteness: the original word embedding e of M has already become highly discrete after pre-training. If h is initialized with random distribution and then optimized with stochastic gradient descent (SGD), which has been proved to only change the parameters in a small neighborhood (AllenZhu et al., 2019), the optimizer would easily fall into local minima.” 按照我的理解，您这段话先说明预训练模型的词向量彼此之间相互离散，但是可训练参数h本身就是随机初始化的，并不来自于词向量，词向量的离散对h的优化有什么影响吗？

THUDM / P-tuning

Why discreteness of word embedding leads to the optimizer easily fall into local minima? #47