how did you calculate the bleu score

Aleph-Alpha / magma

MAGMA - a GPT-style multimodal model that can understand any combination of images and language. NOTE: The freely available model from this repo is only a demo. For the latest multimodal and multilingual models from Aleph Alpha check out our website https://app.aleph-alpha.com

MIT License

469 stars 55 forks source link

how did you calculate the bleu score #40

Closed TobiasLee closed 1 year ago

TobiasLee commented 1 year ago

Hi, thanks for the awesome project. I noticed that the reported BLEU@4 and CIDEr scores in Table 1 are ~10 and ~50 on the MS COCO dataset(zero-shot, after fine-tuning the scores increase to 31 and 90+), respectively, which fall far behind traditional baselines like AoA and CLIP-ViL(they usually achieve ~40 BLEU-4 and 120+ CIDEr). I am wondering whether the difference is due to the evaluation setup, did you use the evaluation in coco-caption or calculate the scores yourself?

CoEich commented 1 year ago

Hi,

we evaluated the scores ourselves using this code https://github.com/Maluuba/nlg-eval to calculate the BLEU and CIDEr metrics.

Cheers,

Constantin