Where is the mask indicator to calculate the decoder loss?

smiles724 commented 1 year ago

Hi, you use token_discrete_loss to compute the decoder loss. However, it seems to include all cross-entropy losses for all input tokens, which may be unreasonable.

In contrast, should you exclude all PAD tokens in your decoder loss?

XiangLi1999 commented 1 year ago

Hi,

Thanks for the question!

We are doing this intentionally and assuming that these PAD tokens are part of the data distribution. So when Diffusion-LM wants to generate sentences of length < 64, it will generate the PAD tokens for the remaining spaces.

Alternatively, if we don't do this during training, the model would learn to generate meaningless tokens after EOS token at generation time. One way to handle is to truncate anything after the EOS token, but we believe the current version is slightly cleaner.

Hope this helps!

smiles724 commented 1 year ago

Thanks for your reply. This is interesting and sounds reasonable!

XiangLi1999 / Diffusion-LM

Where is the mask indicator to calculate the decoder loss? #45