Open dkoguciuk opened 3 years ago
My reproduced results are also much better than their reported results in the paper. My conjecture is that the authors finetuned the model after the paper's been accpeted.
Sure, thank you @tuyunbin for letting me know :+1:
Hi,did you train your own model? I have a question that when I evaluate all of the iteration models on the validation set, I will get a summary result as follows. From it, we can see that the best score of each metric may be gained by different iteration models. So, how do you select the best model for evaluation on the test set? I think a simple solution is to choose the model having the most occuring numbers. Another is to respectively select the best model of each metric to evalute the corresponding metric on the test set, but I am not sure whether this is cheat.
=========Results Summary========== ------------semantic change best result------------- Bleu_4: 0.4297590635585063 (dynamic_sents_10000) CIDEr: 0.9474569644127235 (dynamic_sents_9000) WMD: 0.22721091315188743 (dynamic_sents_8000) Bleu_3: 0.5418116472295542 (dynamic_sents_10000) Bleu_2: 0.6610459720247596 (dynamic_sents_10000) ROUGE_L: 0.598830625420663 (dynamic_sents_9000) SPICE: 0.19844836880511552 (dynamic_sents_10000) METEOR: 0.29471586731759486 (dynamic_sents_9000) Bleu_1: 0.769692278570386 (dynamic_sents_10000) ------------non-semantic change best result------------- Bleu_4: 0.609208133087405 (dynamic_sents_3000) CIDEr: 1.1336741815106786 (dynamic_sents_3000) WMD: 0.6163324610027923 (dynamic_sents_3000) Bleu_3: 0.6440120613001302 (dynamic_sents_3000) Bleu_2: 0.6954593746633804 (dynamic_sents_3000) ROUGE_L: 0.7380316985348573 (dynamic_sents_10000) SPICE: 0.2993724250263486 (dynamic_sents_3000) METEOR: 0.4579204844605651 (dynamic_sents_3000) Bleu_1: 0.7606271392292414 (dynamic_sents_3000) ------------total best result------------- Bleu_4: 0.47355699109792154 (dynamic_sents_10000) CIDEr: 1.1300649391677284 (dynamic_sents_10000) WMD: 0.41847096547139423 (dynamic_sents_10000) Bleu_3: 0.5743767433066301 (dynamic_sents_10000) Bleu_2: 0.6792885081736695 (dynamic_sents_10000) ROUGE_L: 0.6680752260861709 (dynamic_sents_10000) SPICE: 0.24299929901519754 (dynamic_sents_10000) METEOR: 0.3354410976477133 (dynamic_sents_10000) Bleu_1: 0.7781043546021799 (dynamic_sents_10000)
Hi @tuyunbin ,
I've picked up the checkpoint with most occurrences in the Total category, so it's the first of your methods. I would consider the second one as cheating TBH - we want to provide the reader with multiple metrics for a better understanding of the performance of a single model.
I don't remember exactly, but I've done both: training and evaluating provided checkpoint and the results were very similar. Training also takes just a couple of hours.
Hi, @dkoguciuk ,
Did you fix the seed of numpy and pytorch? The results have differences between two training and this cannot adjust the hyper-parameters better. I've fix it using following commands, but it did not work.
random.seed(seed) np.random.seed(seed) torch.manual_seed(seed)
Hi @tuyunbin ,
no, I didn't fix it. But even if you fix numpy and torch seed I think there is some randomness in the computation itself: running massively parallel computations - results in different summation order, which is not alternating for floating-point numbers.
Best, D
Hi @dkoguciuk , did you reproduce the results in the spot-the-diff dataset using @Seth-Park 's code?
Hi @tuyunbin ,
no, I did not work on spot-the-diff, only CLEVR_Change.
Best, D
Hi @Seth-Park ,
I'm struggling to understand the evaluation metrics. In the paper you've got Table 2:
But after downloading and evaluating your pretrained model I got the following numbers:
So, I believe I should multiply those metrics by 100, right? But those are better than in the paper, i.e. in the TOTAL section:
Is there any particular reason why you reported smaller numbers in the paper?