How to evaluate BLEU score on LM1B?

Hzfinfdu / Diffusion-BERT

ACL'2023: DiffusionBERT: Improving Generative Masked Language Models with Diffusion Models

Apache License 2.0

286 stars 24 forks source link

How to evaluate BLEU score on LM1B? #6

Open jzhang38 opened 1 year ago

jzhang38 commented 1 year ago

Dear authors,

I understand that you plan to release your code on January. But could you share more details regarding how you evaluate the BLEU score and PPL on the LM1B dataset? I am also working on Diffusion Model for text and may potentially cite your paper. Thanks!

Hzfinfdu commented 1 year ago

Hi,

We computed the BLEU score with all test data as references and reported the average BLEU score of each generated sentence. We sampled 1K sentences respectively for evaluating BLEU and S-BLEU. For PPL, the ELBO on the test set is an upper bound of token-wise NLL. And we first convert such bound to per-word NLL and use this to get the per-word PPL. Hope this helps!

yujianll commented 1 year ago

@Hzfinfdu Thanks for the great work! I have a follow up question. When you say per-word NLL, do you mean to calculate $\mathcal{L}_{vlb}$ in Eq. 3 for each token? Do you sum up NLL for all tokens in the sequence and use it as NLL for the sequence? Also, I noticed that in Fig. 4, the validation ELBO is around 110 after training. However, the test set PPL is around 60~70. I wonder why would these two values have such a big difference.

Hzfinfdu commented 1 year ago

@yujianll Hi,

Yes, we sum up NLL for all tokens in the sequence as NLL for the sequence.
The validation ELBO is around 110. And the average number of words in each sequence in the test set is around 26. Thus per-word NLL is around 4.23. The test PPL is obtained by exp(4.23).

yujianll commented 1 year ago

@Hzfinfdu Thanks for the reply! I have another low-level question. When you calculate NLL on test set, do you sum for all T diffusion steps, or do you sample a few time steps for calculation? If you do sample, how many time steps do you use?

Hzfinfdu commented 1 year ago

@yujianll Hi,

We trained DiffusionBERT with 512 steps and used DDIM sampling to uniformly sample 128 steps on test set, both for NLL calculation and generation.

Hope this helps!

yujianll commented 1 year ago

Thanks, this helps a lot!