How to understand the results

dkoguciuk commented 3 years ago

Hi @Seth-Park ,

I'm struggling to understand the evaluation metrics. In the paper you've got Table 2:

Screenshot from 2020-10-21 16-21-06

But after downloading and evaluating your pretrained model I got the following numbers:

------------semantic change best result-------------
CIDEr: 1.00742455128 (test)
Bleu_4: 0.511085051903 (test)
Bleu_3: 0.612453337061 (test)
Bleu_2: 0.712512983841 (test)
Bleu_1: 0.80904675167 (test)
ROUGE_L: 0.654282229769 (test)
METEOR: 0.334430665011 (test)
SPICE: 0.2793739702 (test)
------------non-semantic change best result-------------
CIDEr: 1.14646062504 (test)
Bleu_4: 0.618167729466 (test)
Bleu_3: 0.64995045894 (test)
Bleu_2: 0.715953303178 (test)
Bleu_1: 0.783191698339 (test)
ROUGE_L: 0.763303090909 (test)
METEOR: 0.50608216891 (test)
SPICE: 0.346267623357 (test)
------------total best result-------------
CIDEr: 1.14955152668 (test)
Bleu_4: 0.535546570013 (test)
Bleu_3: 0.621429742545 (test)
Bleu_2: 0.71323722181 (test)
Bleu_1: 0.801726535202 (test)
ROUGE_L: 0.708792660339 (test)
METEOR: 0.37936030774 (test)
SPICE: 0.312820796779 (test)

So, I believe I should multiply those metrics by 100, right? But those are better than in the paper, i.e. in the TOTAL section:

Bleu_4 pretrained: 53.6 > Bleu_4 reported 47.3
CIDEr pretrained: 115.0 > CIDEr reported 112.3
METEOR pretrained 37.9 > METEOR reported 33.9 
SPICE pretrained 31.3 > SPICE reported 24.5

Is there any particular reason why you reported smaller numbers in the paper?

tuyunbin commented 3 years ago

My reproduced results are also much better than their reported results in the paper. My conjecture is that the authors finetuned the model after the paper's been accpeted.

dkoguciuk commented 3 years ago

Sure, thank you @tuyunbin for letting me know :+1:

tuyunbin commented 3 years ago

Hi，did you train your own model? I have a question that when I evaluate all of the iteration models on the validation set, I will get a summary result as follows. From it, we can see that the best score of each metric may be gained by different iteration models. So, how do you select the best model for evaluation on the test set? I think a simple solution is to choose the model having the most occuring numbers. Another is to respectively select the best model of each metric to evalute the corresponding metric on the test set, but I am not sure whether this is cheat.

=========Results Summary========== ------------semantic change best result------------- Bleu_4: 0.4297590635585063 (dynamic_sents_10000) CIDEr: 0.9474569644127235 (dynamic_sents_9000) WMD: 0.22721091315188743 (dynamic_sents_8000) Bleu_3: 0.5418116472295542 (dynamic_sents_10000) Bleu_2: 0.6610459720247596 (dynamic_sents_10000) ROUGE_L: 0.598830625420663 (dynamic_sents_9000) SPICE: 0.19844836880511552 (dynamic_sents_10000) METEOR: 0.29471586731759486 (dynamic_sents_9000) Bleu_1: 0.769692278570386 (dynamic_sents_10000) ------------non-semantic change best result------------- Bleu_4: 0.609208133087405 (dynamic_sents_3000) CIDEr: 1.1336741815106786 (dynamic_sents_3000) WMD: 0.6163324610027923 (dynamic_sents_3000) Bleu_3: 0.6440120613001302 (dynamic_sents_3000) Bleu_2: 0.6954593746633804 (dynamic_sents_3000) ROUGE_L: 0.7380316985348573 (dynamic_sents_10000) SPICE: 0.2993724250263486 (dynamic_sents_3000) METEOR: 0.4579204844605651 (dynamic_sents_3000) Bleu_1: 0.7606271392292414 (dynamic_sents_3000) ------------total best result------------- Bleu_4: 0.47355699109792154 (dynamic_sents_10000) CIDEr: 1.1300649391677284 (dynamic_sents_10000) WMD: 0.41847096547139423 (dynamic_sents_10000) Bleu_3: 0.5743767433066301 (dynamic_sents_10000) Bleu_2: 0.6792885081736695 (dynamic_sents_10000) ROUGE_L: 0.6680752260861709 (dynamic_sents_10000) SPICE: 0.24299929901519754 (dynamic_sents_10000) METEOR: 0.3354410976477133 (dynamic_sents_10000) Bleu_1: 0.7781043546021799 (dynamic_sents_10000)

dkoguciuk commented 3 years ago

Hi @tuyunbin ,

I've picked up the checkpoint with most occurrences in the Total category, so it's the first of your methods. I would consider the second one as cheating TBH - we want to provide the reader with multiple metrics for a better understanding of the performance of a single model.

I don't remember exactly, but I've done both: training and evaluating provided checkpoint and the results were very similar. Training also takes just a couple of hours.

tuyunbin commented 3 years ago

Hi, @dkoguciuk ,

Did you fix the seed of numpy and pytorch? The results have differences between two training and this cannot adjust the hyper-parameters better. I've fix it using following commands, but it did not work.

random.seed(seed) np.random.seed(seed) torch.manual_seed(seed)

dkoguciuk commented 3 years ago

Hi @tuyunbin ,

no, I didn't fix it. But even if you fix numpy and torch seed I think there is some randomness in the computation itself: running massively parallel computations - results in different summation order, which is not alternating for floating-point numbers.

Best, D

tuyunbin commented 3 years ago

Hi @dkoguciuk , did you reproduce the results in the spot-the-diff dataset using @Seth-Park 's code?

dkoguciuk commented 3 years ago

Hi @tuyunbin ,

no, I did not work on spot-the-diff, only CLEVR_Change.

Best, D

Seth-Park / RobustChangeCaptioning

How to understand the results #1