Closed shivgodhia closed 3 years ago
Here are some results I've gotten from training multiple times
2021-02-14T05:41:38 | mmf.trainers.callbacks.logistics: val/hateful_memes/cross_entropy: 2.7552, val/total_loss: 2.7552, val/hateful_memes/accuracy: 0.5160, val/hateful_memes/binary_f1: 0.3315, val/hateful_memes/roc_auc: 0.5081
2021-02-14T05:43:42 | mmf.trainers.callbacks.logistics: val/hateful_memes/cross_entropy: 0.7072, val/total_loss: 0.7072, val/hateful_memes/accuracy: 0.5100, val/hateful_memes/binary_f1: 0.4394, val/hateful_memes/roc_auc: 0.5036
2021-02-14T13:56:23 | mmf.trainers.callbacks.logistics: val/hateful_memes/cross_entropy: 0.8421, val/total_loss: 0.8421, val/hateful_memes/accuracy: 0.4960, val/hateful_memes/binary_f1: 0.3668, val/hateful_memes/roc_auc: 0.5317
As you'll see I don't get anywhere really close to the reported scores of acc 52.73 and auroc 58.79
Hi @hivestrung ,
can you clarify which phase you are running the model on? Looking at your issue, my guess is that you are using MMF default which is phase 2 and the final baseline numbers for phase 2 are reported in https://proceedings.neurips.cc//paper/2020/file/1b84c4cee2b8b3d823b30e2d604b1878-Paper.pdf and arxiv need to be yet updated. The numbers in the NeurIPS version are matching with what you are observing as of now.
Hi @apsdehal
How do I check which phase I’m running it on? If you mean the dataset, I changed the dev set to seen, rather than unseen, in the defaults.yaml somewhere within Python site packages/mmf. And I can verify that changing that affects the score.
Also, using the pretrained model From the zoo I do get around 57 for auroc. So it still feels like my dataset is correct, but the pre-trained model is more effective than the standard training that comes with mmf.
Thanks
@apsdehal Apologies, I wasn't sure if you've managed to take a look at my latest comment regarding the phases? Thanks so much by the way! Here, i've compiled the results from my trained model vs using the pretrained model from the model zoo in MMF
test | code | acc | auroc |
---|---|---|---|
my trained model | mmf_run config=mmf/projects/hateful_memes/configs/unimodal/image.yaml model=unimodal_image dataset=hateful_memes run_type=val checkpoint.resume_file=./save_image-grid/unimodal_image_final.pth checkpoint.resume_pretrained=False dataset_config.hateful_memes.annotations.val[0]=hateful_memes/defaults/annotations/dev_seen.jsonl dataset_config.hateful_memes.annotations.test[0]=hateful_memes/defaults/annotations/test_seen.jsonl | 49.60 | 53.17 |
using the pre-trained model from the zoo | mmf_run config=mmf/projects/hateful_memes/configs/unimodal/image.yaml model=unimodal_image dataset=hateful_memes run_type=val checkpoint.resume_zoo=unimodal_image.hateful_memes.images checkpoint.resume_pretrained=False dataset_config.hateful_memes.annotations.val[0]=hateful_memes/defaults/annotations/dev_seen.jsonl dataset_config.hateful_memes.annotations.test[0]=hateful_memes/defaults/annotations/test_seen.jsonl | 51.40 | 57.21 |
arxiv paper | - | 52.73 | 58.79 |
final baseline for phase 2 in neurips paper | - | 50.67 | 52.33 |
Both my self-trained model and the model from the zoo are giving different results on the same dev_seen dataset. It seems that the zoo's model reports numbers similar to the arxiv paper and my self-trained model reports numbers similar to your latest neurips paper. I've just re-run the validations to be absolutely sure.
Any idea why this might be the case?
Hi @hivestrung,
I can try running the exact command on my side to see if I can replicate the result. I would actually expect this within the range. Have you tried running the command multiple times to average?
@hivestrung The difference is caused because the new train set (released in phase 2) is also different was reannotated to fix bad examples. You won't be able to replicate the exact results present in the arXiv. I would suggest that you use phase 2 and use the baselines in the NeurIPS paper. We will try to open up Phase 2 submissions soon and update the arXiv.
So I previously had trouble reproducing the results using the pretrained models from the model zoo. Now that is fine. I moved on to trying to train the model myself and encounter problems reproducing the results using mmf.
Instructions To Reproduce the Issue:
Train Image-grid
Evaluate on validation set
full logs you observed:
When training, I get these warnings and a "targets not found error"
After training, well, the model seems to have been saved anyway so I carry on and use it to evaluate the validation set
Expected behavior:
I expect to get accuracy 52.73, and auroc 58.79
Instead I get accuracy 52.00, and auroc 52.67
The auroc seems to be very different.
I have checked, and I am evaluating on the dev_seen set by changing the dataset in the yaml config file, at /home/username/.local/lib/python3.8/site-packages/mmf/configs/datasets/hateful_memes/defaults.yaml
I also get closer and satisfactory reproduction of results when evaluating using the pre-trained model. So there must be something wrong with the training process (see the errors I encountered).
Environment: