Closed shivgodhia closed 3 years ago
I believe the numbers reported in the FB paper were the baselines ran on the phase 1 dataset (dev_seen/test_seen) while yours must be running on the phase 2 dataset (dev_unseen, test_unseen).
You can change the dataset in the yaml config file, mine was located at /home/username/.local/lib/python3.8/site-packages/mmf/configs/datasets/hateful_memes/defaults.yaml where you can change
22 annotations:
...
25 val:
26 - hateful_memes/defaults/annotations/dev_unseen.jsonl
to :
22 annotations:
...
25 val:
26 - hateful_memes/defaults/annotations/dev_seen.jsonl
Wow, thank you so much, I never thought to look at this. I was thinking the problem might have been the seed or something. I am getting much closer numbers now, though still not the same.
For example, Text BERT:
val/hateful_memes/cross_entropy: 0.7298, val/total_loss: 0.7298, val/hateful_memes/accuracy: 0.5880, val/hateful_memes/binary_f1: 0.4798, val/hateful_memes/roc_auc: 0.6528
The accuracy reported in the paper is 58.26, AUROC 64.65. Mine is seemingly 58.80 and ROC_AUC is 65.28. Those are close but is there something else affecting the score?
You're welcome. I do have similar yet slightly different numbers as well. My image-grid validation results are actually way off compared to what the paper reports, the other baselines that I was able to run are much closer.
I have to train mine on much lower batch size because my machine can't handle large batch size. I suspect this is the reason why we get different accuracy and AUROC. It it close enough in my opinion still.
Out of curiosity, were you able to train the image-region, mmbt-region and the vilbert/viual bert ones? I can't. I have reported my issue in a previous thread. I still were nunable to fix the problem.
Ahhh interesting. That's reassuring. How does one define "close enough" though? +/- 1?
I am trying to train now, but my task requires reimplementing. For a start I was using the pretrained models to see if I can reproduce the numbers as a sanity check before getting on with the other work.
I am able to "train" the Image-Grid but my machine is too slow so I terminated it. But it was making a run through the epochs so I think that it was fine. I haven't encountered your issue so can't help at the moment but will keep it in mind.
@kLabille Thanks for helping out @hivestrung with the issue.
@hivestrung You can't replicate the exact numbers in the paper because those are the averages over multiple runs with different seeds. That's why you will see a minor difference in the metrics but they should be in the ballpark.
@apsdehal Ahhh okay thank you. Can I also check, it's not super clear in the paper or maybe I am new to this and don't know the usual practice, but for those pre-trained models evaluated on the validation set, are they fine-tuned with the training data or not at all? Thank you very much!
I have an issue reproducing the baselines in the Hateful Memes Paper. Specifically I am trying to get the baselines for Text BERT but I am also not able to get the baselines for Image-Grid
Instructions To Reproduce the Issue:
I ran this exact command for Text BERT, and ran it twice and got the same result
And this command for Image-Grid:
Log of TEXT Bert validation evaluation I observed:
Expected behavior:
For Text BERT, I expected Validation Accuracy to be 58.26%, and AUROC to be 64.65%. But I seem to have gotten 61.67% and 61.19% respectively. A similar error is gotten for Image-Grid:
Environment:
Provide your environment information using the following command: