label smoothing mistake

Hi, thanks for your interest!

Technically, both 1/(t+1) and weight_t are only associated with the diffusion ELBO objective but not the label smoothing loss. Therefore, it is reasonable to use arbitrary weighting for the label smoothing loss (which is often used as an auxiliary objective for regularization) to scale its effect; we conducted various ablations in our preliminary experiments and found that only multiplying label smoothing loss with weight_t yields the best performance for translation tasks.

However, it could be true that this choice may not be optimal in all cases and that carefully tuning the weighting in a task-specific manner may lead to better performance.

Hope this clears things up xD

HKUNLP / reparam-discrete-diffusion

label smoothing mistake #1