XiangLi1999 / Diffusion-LM

Diffusion-LM
Apache License 2.0
1.03k stars 134 forks source link

Final e2e training objective definition in code #8

Closed jwkirchenbauer closed 2 years ago

jwkirchenbauer commented 2 years ago

Hi again,

I am working on understanding the training setup and want to make sure I am clear on the loss function that worked out best in your paper. i.e. which you'd consider best in the end for learning a BERT backbone word vector diffusion model.

I believe the training_losses_e2e gives the three term loss, which matches the vlb equation somewhat. But what does the training_losses_e2e_simple give you exactly? Which one (both?) did you find were optimal for learning the embedding step, the decoding step, and the denoising step simultaneously? And what terms correspond to which components?

Thanks!

XiangLi1999 commented 2 years ago

training_loss_e2e_simple is not the right one. it's not in the paper and it's something in my preliminary experiment. It's an objective that collapses very easily.

Instead this line https://github.com/XiangLi1999/Diffusion-LM/blob/344f447deaaa9eccc6c81ba859e05c2166d40fb4/improved-diffusion/improved_diffusion/gaussian_diffusion.py#L1508 in "self.loss_type == LossType.E2E_MSE" for training_losses_e2e is one that works the best.

I think for the detailed break down, might be worth looking into the appendix of the paper to understand the derivation.

jwkirchenbauer commented 2 years ago

Thanks so much!