facebookresearch / Mask-Predict

A masked language modeling objective to train a model to predict any subset of the target words, conditioned on both the input text and a partially masked target translation.
Other
240 stars 38 forks source link

the length loss should devide nsentences? #9

Open PanXiebit opened 4 years ago

PanXiebit commented 4 years ago

https://github.com/facebookresearch/Mask-Predict/blob/11c753526ebbbf5b946d1305a613b830f4f29232/fairseq/criterions/label_smoothed_length_cross_entropy.py#L63

https://github.com/facebookresearch/Mask-Predict/blob/11c753526ebbbf5b946d1305a613b830f4f29232/fairseq/criterions/label_smoothed_length_cross_entropy.py#L73

the total loss is sumed by nll_loss, smooth_loss and length loss. When compute the mean loss, nll_loss, smooth_loss should devide ntokens, and length loss should devide nsentences. However, in the source code, both of them are devided by ntokens?

jungokasai commented 4 years ago

That's right. It doesn't seem clear from the paper, but indeed they divide the length loss by the number of tokens. And I found that was actually very crucial; if you divide it by the number of sentences instead, the length loss over-dominates the total loss and leads to a training failure. The master fairseq implementation takes a slightly different approach: https://github.com/pytorch/fairseq/blob/master/fairseq/models/nat/cmlm_transformer.py#L147 They divide the length loss by the number of tokens and then multiply by "length_loss_factor"=0.1. Either way, we have to discount the length loss.