Closed zmykevin closed 2 years ago
I don't think the GPU matters that much. RTX 2080Ti is even newer than the Titan V that I used to train the models, so it should be able to handle them (although I think I had to train one of the LSTM baselines on a Titan RTX because it needs more than 12GB of memory). Using the provided conda environment, all the packages have their versions pinned so it can't also be a version issue.
Just confirming, are these the two commands you ran:
# Evaluate on test set.
tell evaluate expt/goodnews/9_transformer_objects/config.yaml -m expt/goodnews/9_transformer_objects/serialization/best.th
# Compute evaluation metrics
python scripts/compute_metrics.py -c data/goodnews/name_counters.pkl expt/goodnews/9_transformer_objects/serialization/generations.jsonl
where best.th
is the pretrained checkpoint obtained from me? And the BLEU-4 score is read off the output of the compute_metrics.py
script?
I just ran the evaluation script again myself. The BLEU-4 score given by the evaluate
command is indeed 3.15. The reported BLEU-4 score comes from the compute_metrics.py
script, which gives us 6.05.
Looking at the code again, I think when we run evaluate
, the average BLEU-4 score is computed for each batch, and then we take the average across the batches. Batches can have different sizes, so this BLEU-4 score would actually change depending on the batch size.
In contrast, compute_metrics.py
takes the BLEU-4 across all the test samples directly. This is what the original GoodNews paper did, and so I reported the same metric to make it comparable to previous work.
Thanks Tran! Really appreciate for the thoughtful answer that definitely answer my question. I will look into the code to understand the implementation in more details.
Hi Alasdair, I want to reproduce the result reported in your paper with this code base. I tried with the 9_transformer_objects checkpoint on the goodnews dataset on my server with a single RTX 2080Ti GPU, but it end up only get a BLEU score of 3.15, which is quite low compared to the result reported in the paper. I did not change anything in the code or configuration file except that I use a different GPU (which I think should not matter that much during evaluation). I wonder if there is anything you would suggest to check or change (such as some configuration setting) in order to get the results from your paper? Thanks