Training Cost due to the EMA mechanism

Thanks for such nice work and your kind released code! I have just tried it and found that the EMA mechanism has been used in your optimization of the Diffusion-LM code, which limits the update of the model parameters a lot. Indeed, such a way may stabilize the training process but also increase the training cost. I suppose once it has been removed, would the performance of Diffusion-LM degrade a lot? Or maybe the training cost could be further reduced a lot?

I am very expected to know the motivation of using EMA in your approach. Looking forward to your response.

XiangLi1999 / Diffusion-LM

Training Cost due to the EMA mechanism #50