Open jzhang38 opened 1 year ago
Hi,
We computed the BLEU score with all test data as references and reported the average BLEU score of each generated sentence. We sampled 1K sentences respectively for evaluating BLEU and S-BLEU. For PPL, the ELBO on the test set is an upper bound of token-wise NLL. And we first convert such bound to per-word NLL and use this to get the per-word PPL. Hope this helps!
@Hzfinfdu Thanks for the great work! I have a follow up question. When you say per-word NLL, do you mean to calculate $\mathcal{L}_{vlb}$ in Eq. 3 for each token? Do you sum up NLL for all tokens in the sequence and use it as NLL for the sequence? Also, I noticed that in Fig. 4, the validation ELBO is around 110 after training. However, the test set PPL is around 60~70. I wonder why would these two values have such a big difference.
@yujianll Hi,
@Hzfinfdu Thanks for the reply! I have another low-level question. When you calculate NLL on test set, do you sum for all T diffusion steps, or do you sample a few time steps for calculation? If you do sample, how many time steps do you use?
@yujianll Hi,
We trained DiffusionBERT with 512 steps and used DDIM sampling to uniformly sample 128 steps on test set, both for NLL calculation and generation.
Hope this helps!
Thanks, this helps a lot!
Dear authors,
I understand that you plan to release your code on January. But could you share more details regarding how you evaluate the BLEU score and PPL on the LM1B dataset? I am also working on Diffusion Model for text and may potentially cite your paper. Thanks!