Open Y-H-Joe opened 2 years ago
你说得很有道理,但是根据我自己的实验(跑的比较粗糙,训练集和测试集都是fra-eng.txt那个数据集的前一万个双语句子对。但是不管修改与否,在这一万个样本上的bleu均值基本没变)来看,并不会出现你所说的情况,也就是:“模型尽可能长的预测句子,从而使更多部分在mask之后变成0 进而 拉低 weighted_loss = (unweighted_loss * weights).mean(dim=1)
的计算结果“。
原因我猜测是这样的:
在训练过程中,输入Decoder的是
另一个使用mean的理由有些道理,下游loss显示的时候 metric[0] / metric[1] 已经是一个batch内所有句子的loss之和除以valid lens之和,如果loss本身计算还要采用mean去除以句子长度的话就不太合理了。
@Y-H-Joe Please check our recent implementation. We excluded the loss on padding tokens and the loss has nothing to do with the padding locations. Taking the average of the losses over all valid tokens is quite common.
this line: weighted_loss = (unweighted_loss weights).mean(dim=1) should be corrected to: weighted_loss = (unweighted_loss weights).sum(dim=1)
reason: when use
mean
, the padding locations will be calculated as denoimator to drag down the loss. In this case, model will learn to cheat by predicting as long as possible (as a result, noeos
will be generated). ( I've tested it). Also, usemean
is inconsistent with the downstreamloss
calculation. In the downstream,loss
will be divided bynum_token
. In this case, CEloss will be in total divided by square ofnum_token
, which makes no sense.when use
sum
, the padding locations will not be calculated (all zeros). In totalloss
will not be divieded bynum_token
square but justnum_token
.thanks.