Evaluation metrics - Githubissues

Hello, I hope your research goes well. 😀

I am trying to evaluate the metrics that you proposed for my model.

I have read your paper. However, I am asking you to double-check. (my results seem a bit odd and off the scale, that's why 😢)

I presume that the "character F1" score represents the "micro avg" of F1 score outputs from your eval_clasifier.py code? Am I correct?
also, "Frame accuracy" represents "eval Image Exact Match Acc" outputs from your eval_classifier.py code?
are BLEU 2 and BLEU 3 scores scaled by 100? I have tested your translate.py code with my generated images, and I've got about 0.04-ish scores. are the BLEU scores you reported multiplied by 100?
Lastly, It is unclear about the R-precision evaluation method. Do I require to train your code (H-DAMSM)? if so, when is the right time to stop the training and benchmark my model?
To fair comparison, is it possible to be provided your H-DAMSM pretrained weight?

I am currently stuck on the R-precision evaluation using H-DAMSM. So, I was thinking of utilizing the recent CLIP R-Precision instead, but I am leaving this issue to avoid a fair comparison issue.

adymaharana / StoryViz

Evaluation metrics #4