XiangLi1999 / Diffusion-LM

Diffusion-LM
Apache License 2.0
1.02k stars 133 forks source link

Reproducing Table 5: Sentence Infilling - CIDEr / BLEU-4 metrics #59

Open yair-schiff opened 1 year ago

yair-schiff commented 1 year ago

Hi @XiangLi1999,

Thank you for open sourcing this work!

I am trying to reproduce the results from Table 5 - the infilling experiment. Specifically, I was wondering where the CIDEr and BLEU-4 scores come from and how they are computed? On the aNLG leaderboard, I don't see those metrics reported

Screen Shot 2023-03-28 at 10 15 21 PM Screen Shot 2023-03-28 at 10 15 42 PM

Any guidance you can provide here will be much appreciated.

Thanks!

XiangLi1999 commented 1 year ago

Hi Yair,

Thanks for reaching out!

We compute these two scores because it’s also reported in https://arxiv.org/pdf/2202.11705.pdf (which is our primary baseline of comparison).

We compute it via evaluation scripts released along with the e2e benchmark. https://github.com/tuetschek/e2e-metrics

Best, Lisa