alasdairtran / transform-and-tell

[CVPR 2020] Transform and Tell: Entity-Aware News Image Captioning
https://transform-and-tell.ml/
89 stars 14 forks source link

Questions regarding reproducing the result in the paper. #20

Closed zmykevin closed 2 years ago

zmykevin commented 3 years ago

Hi Alasdair, I want to reproduce the result reported in your paper with this code base. I tried with the 9_transformer_objects checkpoint on the goodnews dataset on my server with a single RTX 2080Ti GPU, but it end up only get a BLEU score of 3.15, which is quite low compared to the result reported in the paper. I did not change anything in the code or configuration file except that I use a different GPU (which I think should not matter that much during evaluation). I wonder if there is anything you would suggest to check or change (such as some configuration setting) in order to get the results from your paper? Thanks

alasdairtran commented 3 years ago

I don't think the GPU matters that much. RTX 2080Ti is even newer than the Titan V that I used to train the models, so it should be able to handle them (although I think I had to train one of the LSTM baselines on a Titan RTX because it needs more than 12GB of memory). Using the provided conda environment, all the packages have their versions pinned so it can't also be a version issue.

Just confirming, are these the two commands you ran:

# Evaluate on test set. 
tell evaluate expt/goodnews/9_transformer_objects/config.yaml -m expt/goodnews/9_transformer_objects/serialization/best.th

# Compute evaluation metrics
python scripts/compute_metrics.py -c data/goodnews/name_counters.pkl expt/goodnews/9_transformer_objects/serialization/generations.jsonl

where best.th is the pretrained checkpoint obtained from me? And the BLEU-4 score is read off the output of the compute_metrics.py script?

alasdairtran commented 3 years ago

I just ran the evaluation script again myself. The BLEU-4 score given by the evaluate command is indeed 3.15. The reported BLEU-4 score comes from the compute_metrics.py script, which gives us 6.05.

Looking at the code again, I think when we run evaluate, the average BLEU-4 score is computed for each batch, and then we take the average across the batches. Batches can have different sizes, so this BLEU-4 score would actually change depending on the batch size.

In contrast, compute_metrics.py takes the BLEU-4 across all the test samples directly. This is what the original GoodNews paper did, and so I reported the same metric to make it comparable to previous work.

zmykevin commented 3 years ago

Thanks Tran! Really appreciate for the thoughtful answer that definitely answer my question. I will look into the code to understand the implementation in more details.