最近拜读了您的论文《GPT Understands, Too》,关于这段话有些不理解,希望您能帮忙指导解释下:”1) Discreteness: the original word embedding e of M has already become highly discrete after pre-training. If h is initialized with random distribution and then optimized with stochastic gradient descent (SGD), which has been proved to only change the parameters in a small neighborhood (AllenZhu et al., 2019), the optimizer would easily fall into local minima.” 按照我的理解,您这段话先说明预训练模型的词向量彼此之间相互离散,但是可训练参数h本身就是随机初始化的,并不来自于词向量,词向量的离散对h的优化有什么影响吗?
最近拜读了您的论文《GPT Understands, Too》,关于这段话有些不理解,希望您能帮忙指导解释下:”1) Discreteness: the original word embedding e of M has already become highly discrete after pre-training. If h is initialized with random distribution and then optimized with stochastic gradient descent (SGD), which has been proved to only change the parameters in a small neighborhood (AllenZhu et al., 2019), the optimizer would easily fall into local minima.” 按照我的理解,您这段话先说明预训练模型的词向量彼此之间相互离散,但是可训练参数h本身就是随机初始化的,并不来自于词向量,词向量的离散对h的优化有什么影响吗?