MaskedSoftmaxCELoss code wrong

Y-H-Joe commented 2 years ago

this line: weighted_loss = (unweighted_loss weights).mean(dim=1) should be corrected to: weighted_loss = (unweighted_loss weights).sum(dim=1)

reason: when use mean, the padding locations will be calculated as denoimator to drag down the loss. In this case, model will learn to cheat by predicting as long as possible (as a result, no eos will be generated). ( I've tested it). Also, use mean is inconsistent with the downstream loss calculation. In the downstream, loss will be divided by num_token. In this case, CEloss will be in total divided by square of num_token, which makes no sense.

when use sum, the padding locations will not be calculated (all zeros). In total loss will not be divieded by num_token square but just num_token.

thanks.

zhmou commented 2 years ago

你说得很有道理，但是根据我自己的实验（跑的比较粗糙，训练集和测试集都是fra-eng.txt那个数据集的前一万个双语句子对。但是不管修改与否，在这一万个样本上的bleu均值基本没变）来看，并不会出现你所说的情况，也就是：“模型尽可能长的预测句子，从而使更多部分在mask之后变成0 进而拉低 weighted_loss = (unweighted_loss * weights).mean(dim=1)的计算结果“。

原因我猜测是这样的：在训练过程中，输入Decoder的是 + label[:, :-1] （label是padding之后的带有 token的句子）这样一个句子。time step实际是有限长的，所以训练过程中Decoder生成的句子和label的长度一致。

另一个使用mean的理由有些道理，下游loss显示的时候 metric[0] / metric[1] 已经是一个batch内所有句子的loss之和除以valid lens之和，如果loss本身计算还要采用mean去除以句子长度的话就不太合理了。

cheungdaven commented 2 years ago

@Y-H-Joe Please check our recent implementation. We excluded the loss on padding tokens and the loss has nothing to do with the padding locations. Taking the average of the losses over all valid tokens is quite common.

d2l-ai / d2l-en

MaskedSoftmaxCELoss code wrong #2076