the length loss should devide nsentences?

facebookresearch / Mask-Predict

A masked language modeling objective to train a model to predict any subset of the target words, conditioned on both the input text and a partially masked target translation.

Other

240 stars 38 forks source link

That's right. It doesn't seem clear from the paper, but indeed they divide the length loss by the number of tokens. And I found that was actually very crucial; if you divide it by the number of sentences instead, the length loss over-dominates the total loss and leads to a training failure. The master fairseq implementation takes a slightly different approach: https://github.com/pytorch/fairseq/blob/master/fairseq/models/nat/cmlm_transformer.py#L147 They divide the length loss by the number of tokens and then multiply by "length_loss_factor"=0.1. Either way, we have to discount the length loss.

facebookresearch / Mask-Predict

the length loss should devide nsentences? #9