aimagelab / meshed-memory-transformer

Meshed-Memory Transformer for Image Captioning. CVPR 2020
BSD 3-Clause "New" or "Revised" License
517 stars 136 forks source link

Reproduce results with test.py #35

Closed XuMengyaAmy closed 3 years ago

XuMengyaAmy commented 3 years ago

Q1: In test.py: data = torch.load('meshed_memory_transformer.pth')

data = torch.load('saved_models/m2_transformer_best.pth')

model.load_state_dict(data['state_dict']) print("Epoch %d" % data['epoch']) print(data['best_cider'])

Error: KeyError: 'epoch', KeyError: 'best_cider'

The provided 'meshed_memory_transformer.pth' is not saved from train.py? Because when I use the saved model from training in test.py. there is no error. Where the provided 'meshed_memory_transformer.pth' comes from?

And for my own dataset, in test.py, I load the saved model, why the performance drops compared with the evaluation metrics recorded in train.py?

Q2: dict_dataset_val = val_dataset.image_dictionary({'image': image_field, 'text': RawField()}) What's the function of "image_dictionary"? What's the difference with dict_dataset_val and val_dataset? I print them out, and observed that the caption of these two are different. And the len(dict_dataset_val) is different with len(val_dataset). Why is that?

Thanks for your help!

marcellacornia commented 3 years ago

Hi @XuMengyaAmy, thanks for your interest in our work!

In the released weight file, we only stored the state_dict of the best model. In our experiments, the best cider was obtained at the epoch 28.

Regarding the evaluation, we created the image_dictionary to group together all ground-truth captions of the same image. In this way, the captioning evaluation metrics are calculated by comparing the caption generated for a given image whit all five ground-truth captions of that image. For this reason, the length of the val_dataset is five times that of the dict_dataset_val.

I hope this helps!