create optim groups. Any parameters that is 2D will be weight decayed, otherwise no.

i.e. all weight tensors in matmuls + embeddings decay, all biases and layernorms don't.

decay_params = [p for n, p in param_dict.items() if p.dim() >= 2] nodecay_params = [p for n, p in param_dict.items() if p.dim() < 2] optim_groups = [ {'params': decay_params, 'weight_decay': weight_decay}, {'params': nodecay_params, 'weight_decay': 0.0} ]

DLLXW commented 8 months ago

配置优化器的部分为什么，大于或等于2D的参数会被衰减，小于2D不会衰减？

来自chatgpt的答案：

在机器学习中，优化算法常常使用权重衰减（weight decay）来控制模型的复杂度，并防止过拟合。权重衰减通过在损失函数中添加一个正则化项来实现，该项会惩罚较大的权重值。

在给定的代码片段中，参数字典（param_dict）中的参数被分为两组：大于或等于2D的参数（decay_params）和小于2D的参数（nodecay_params）。

通常情况下，神经网络的参数是一个张量（Tensor），其维度（dimension）表示了参数的自由度。一般来说，权重参数是一个矩阵（2D）或张量（高于2D），而偏置参数只是一个向量（1D）。

对于大于或等于2D的参数，如权重矩阵，将会应用权重衰减（weight decay），以限制参数的大小并减少过拟合的风险。这是因为权重矩阵通常具有更多的自由度，如果不对其进行约束，它们可能会变得过于复杂，导致模型过拟合训练数据。

而对于小于2D的参数，如偏置向量，通常不会应用权重衰减。这是因为偏置参数的维度较低，对模型的复杂度影响较小，所以在许多情况下，将它们排除在权重衰减的范围之外。

通过将参数分为具有不同衰减策略的组，可以更好地控制模型的复杂度，并为不同类型的参数提供适当的优化策略。

zerozhoujie commented 8 months ago

DLLXW / baby-llama2-chinese

配置优化器的部分为什么，大于或等于2D的参数会被衰减，小于2D不会衰减？ #55

create optim groups. Any parameters that is 2D will be weight decayed, otherwise no.

i.e. all weight tensors in matmuls + embeddings decay, all biases and layernorms don't.