Explanation Evaluation Metric

jeffz0 commented 5 years ago

Hey Seth,

I've revisited your code again and I've been running some experiments with your dataset in pytorch! To evaluate the explanations, I've been using an adapted version of https://github.com/tylin/coco-caption. I've also used your pretrained models to generate explanations for comparisons to my results, but I notice I get numbers that are different than the reported numbers in your paper.

Using your pretrained model with --use_gt, I get 16.7 on Bleu-4 (19.8 in the paper), 51.3 for Cider (73.4 in the paper), and 39.6 for Rouge (44.0) in the paper. What evaluation code did you use to run your evaluation metrics and what do you think could be the reason from this difference?

Thanks!

Jeff

Seth-Park commented 5 years ago

Hi Jeff,

Have you tried evaluating the sentences uploaded here: https://drive.google.com/drive/u/1/folders/17xO2nXgN4oLwCpwyFtMLdMIH9laQOpri

This is the generated sentences used for the paper and I was able to get the same numbers using my caption evaluation code. I believe I am using the same repo. If you do not get the same results, the possible reason for the difference could be the eval code. But if you do get the same numbers, then it's the problem of my pretrained model which I would take a deeper look. But just for the sake of sanity check can you run your eval code on the uploaded sentences?

Thanks, Seth

jeffz0 commented 5 years ago

Hey Seth,

I appreciate the quick reply! I realized I was incorrectly adapting the coco-caption code by not using PTBTokenizer(). After tokenizing the sentences, I achieve the same scores you report in the paper using your released model.

Thanks again,

Jeff

Seth-Park / MultimodalExplanations

Explanation Evaluation Metric #9