Reproducing Table 5: Sentence Infilling - CIDEr / BLEU-4 metrics

XiangLi1999 / Diffusion-LM

Diffusion-LM

Apache License 2.0

1.02k stars 133 forks source link

Reproducing Table 5: Sentence Infilling - CIDEr / BLEU-4 metrics #59

Open yair-schiff opened 1 year ago

yair-schiff commented 1 year ago

Hi @XiangLi1999,

Thank you for open sourcing this work!

I am trying to reproduce the results from Table 5 - the infilling experiment. Specifically, I was wondering where the CIDEr and BLEU-4 scores come from and how they are computed? On the aNLG leaderboard, I don't see those metrics reported

Any guidance you can provide here will be much appreciated.

Thanks!

XiangLi1999 commented 1 year ago

Hi Yair,

Thanks for reaching out!

We compute these two scores because it’s also reported in https://arxiv.org/pdf/2202.11705.pdf (which is our primary baseline of comparison).

We compute it via evaluation scripts released along with the e2e benchmark. https://github.com/tuetschek/e2e-metrics

Best, Lisa