Questions about the NLL loss

Hi @XiangLi1999 ,

Thanks for the amazing work! I have encountered some questions while implementing DiffusionLM:

During my experiments, I notice that decoder_nll (CE loss essentially) equals to zero for a period of training (about 8k steps). Then decoder_nll occurs with increasing values. Is this phenomenon normal for the training of DiffusionLM? How will decoder_nll perform is the training is implemented correctly?
The second question is about tT_loss. tT_loss equals to constant value during training (the value is about 1.3e-7). This happens when I try to implement a cosine annealing and warmup upon the training learning rate. However, when I use constant learning rate or linear decay strategy. tT_loss starts decreasing. I am now confused about which curve should be correct for training DiffusionLM. Could you explain a little bit about how the loss curve of tT_loss would occur if DIffusionLM is trained correctly?

Thanks you in advance for paying attention to this issue from your busy schedule. It would do me a big favor if you could help me out with the aforementioned questions.

Best,

XiangLi1999 / Diffusion-LM

Questions about the NLL loss #66