Can't reproduce Text BERT and Image-Grid validation set evaluations from paper

shivgodhia commented 3 years ago

I have an issue reproducing the baselines in the Hateful Memes Paper. Specifically I am trying to get the baselines for Text BERT but I am also not able to get the baselines for Image-Grid

Instructions To Reproduce the Issue:

I ran this exact command for Text BERT, and ran it twice and got the same result

mmf_run config=mmf/projects/hateful_memes/configs/unimodal/bert.yaml model=unimodal_text dataset=hateful_memes run_type=val checkpoint.resume_zoo=unimodal_text.hateful_memes.bert checkpoint.resume_pretrained=False

And this command for Image-Grid:

mmf_run config=mmf/projects/hateful_memes/configs/unimodal/image.yaml model=unimodal_image dataset=hateful_memes run_type=val checkpoint.resume_zoo=unimodal_image.hateful_memes.images checkpoint.resume_pretrained=False

Log of TEXT Bert validation evaluation I observed:

/Users/shiv/Library/Mobile Documents/com~apple~CloudDocs/Dissertation/hateful-memes/env/lib/python3.8/site-packages/mmf/utils/configuration.py:535: UserWarning: Device specified is 'cuda' but cuda is not present. Switching to CPU version.
  warnings.warn(
2021-01-18T11:38:43 | mmf.utils.configuration: Overriding option config to mmf/projects/hateful_memes/configs/unimodal/bert.yaml
2021-01-18T11:38:43 | mmf.utils.configuration: Overriding option model to unimodal_text
2021-01-18T11:38:43 | mmf.utils.configuration: Overriding option datasets to hateful_memes
2021-01-18T11:38:43 | mmf.utils.configuration: Overriding option run_type to val
2021-01-18T11:38:43 | mmf.utils.configuration: Overriding option checkpoint.resume_zoo to unimodal_text.hateful_memes.bert
2021-01-18T11:38:43 | mmf.utils.configuration: Overriding option checkpoint.resume_pretrained to False
2021-01-18T11:38:43 | mmf: Logging to: ./save/train.log
2021-01-18T11:38:43 | mmf_cli.run: Namespace(config_override=None, local_rank=None, opts=['config=mmf/projects/hateful_memes/configs/unimodal/bert.yaml', 'model=unimodal_text', 'dataset=hateful_memes', 'run_type=val', 'checkpoint.resume_zoo=unimodal_text.hateful_memes.bert', 'checkpoint.resume_pretrained=False'])
2021-01-18T11:38:44 | mmf_cli.run: Torch version: 1.6.0
2021-01-18T11:38:44 | mmf_cli.run: Using seed 43511271
2021-01-18T11:38:44 | mmf.trainers.mmf_trainer: Loading datasets
2021-01-18T11:38:50 | mmf.trainers.mmf_trainer: Loading model
2021-01-18T11:38:55 | mmf.trainers.mmf_trainer: Loading optimizer
2021-01-18T11:38:55 | mmf.trainers.mmf_trainer: Loading metrics
2021-01-18T11:38:55 | mmf.utils.checkpoint: Loading checkpoint
WARNING 2021-01-18T11:38:57 | mmf: Key data_parallel is not present in registry, returning default value of None
WARNING 2021-01-18T11:38:57 | mmf: Key distributed is not present in registry, returning default value of None
WARNING 2021-01-18T11:38:57 | mmf: Key data_parallel is not present in registry, returning default value of None
WARNING 2021-01-18T11:38:57 | mmf: Key distributed is not present in registry, returning default value of None
WARNING 2021-01-18T11:38:58 | mmf.utils.checkpoint: Missing keys ['base.encoder.embeddings.position_ids'] in the checkpoint.
If this is not your checkpoint, please open up an issue on MMF GitHub. 
Unexpected keys if any: []
WARNING 2021-01-18T11:38:58 | py.warnings: /Users/shiv/Library/Mobile Documents/com~apple~CloudDocs/Dissertation/hateful-memes/env/lib/python3.8/site-packages/mmf/utils/checkpoint.py:291: UserWarning: 'optimizer' key is not present in the checkpoint asked to be loaded. Skipping.
  warnings.warn(

WARNING 2021-01-18T11:38:58 | py.warnings: /Users/shiv/Library/Mobile Documents/com~apple~CloudDocs/Dissertation/hateful-memes/env/lib/python3.8/site-packages/mmf/utils/checkpoint.py:291: UserWarning: 'optimizer' key is not present in the checkpoint asked to be loaded. Skipping.
  warnings.warn(

WARNING 2021-01-18T11:38:58 | py.warnings: /Users/shiv/Library/Mobile Documents/com~apple~CloudDocs/Dissertation/hateful-memes/env/lib/python3.8/site-packages/mmf/utils/checkpoint.py:334: UserWarning: 'lr_scheduler' key is not present in the checkpoint asked to be loaded. Setting lr_scheduler's last_epoch to current_iteration.
  warnings.warn(

WARNING 2021-01-18T11:38:58 | py.warnings: /Users/shiv/Library/Mobile Documents/com~apple~CloudDocs/Dissertation/hateful-memes/env/lib/python3.8/site-packages/mmf/utils/checkpoint.py:334: UserWarning: 'lr_scheduler' key is not present in the checkpoint asked to be loaded. Setting lr_scheduler's last_epoch to current_iteration.
  warnings.warn(

2021-01-18T11:38:58 | mmf.utils.checkpoint: Checkpoint loaded.
2021-01-18T11:38:58 | mmf.utils.checkpoint: Current num updates: 0
2021-01-18T11:38:58 | mmf.utils.checkpoint: Current iteration: 0
2021-01-18T11:38:58 | mmf.utils.checkpoint: Current epoch: 0
2021-01-18T11:38:58 | mmf.trainers.mmf_trainer: ===== Model =====
2021-01-18T11:38:58 | mmf.trainers.mmf_trainer: UnimodalText(
  (base): UnimodalBase(
    (encoder): BertModelJit(
      (embeddings): BertEmbeddingsJit(
        (word_embeddings): Embedding(30522, 768, padding_idx=0)
        (position_embeddings): Embedding(512, 768)
        (token_type_embeddings): Embedding(2, 768)
        (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (encoder): BertEncoderJit(
        (layer): ModuleList(
          (0): BertLayerJit(
            (attention): BertAttentionJit(
              (self): BertSelfAttentionJit(
                (query): Linear(in_features=768, out_features=768, bias=True)
                (key): Linear(in_features=768, out_features=768, bias=True)
                (value): Linear(in_features=768, out_features=768, bias=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (output): BertSelfOutput(
                (dense): Linear(in_features=768, out_features=768, bias=True)
                (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (intermediate): BertIntermediate(
              (dense): Linear(in_features=768, out_features=3072, bias=True)
            )
            (output): BertOutput(
              (dense): Linear(in_features=3072, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (1): BertLayerJit(
            (attention): BertAttentionJit(
              (self): BertSelfAttentionJit(
                (query): Linear(in_features=768, out_features=768, bias=True)
                (key): Linear(in_features=768, out_features=768, bias=True)
                (value): Linear(in_features=768, out_features=768, bias=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (output): BertSelfOutput(
                (dense): Linear(in_features=768, out_features=768, bias=True)
                (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (intermediate): BertIntermediate(
              (dense): Linear(in_features=768, out_features=3072, bias=True)
            )
            (output): BertOutput(
              (dense): Linear(in_features=3072, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (2): BertLayerJit(
            (attention): BertAttentionJit(
              (self): BertSelfAttentionJit(
                (query): Linear(in_features=768, out_features=768, bias=True)
                (key): Linear(in_features=768, out_features=768, bias=True)
                (value): Linear(in_features=768, out_features=768, bias=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (output): BertSelfOutput(
                (dense): Linear(in_features=768, out_features=768, bias=True)
                (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (intermediate): BertIntermediate(
              (dense): Linear(in_features=768, out_features=3072, bias=True)
            )
            (output): BertOutput(
              (dense): Linear(in_features=3072, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (3): BertLayerJit(
            (attention): BertAttentionJit(
              (self): BertSelfAttentionJit(
                (query): Linear(in_features=768, out_features=768, bias=True)
                (key): Linear(in_features=768, out_features=768, bias=True)
                (value): Linear(in_features=768, out_features=768, bias=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (output): BertSelfOutput(
                (dense): Linear(in_features=768, out_features=768, bias=True)
                (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (intermediate): BertIntermediate(
              (dense): Linear(in_features=768, out_features=3072, bias=True)
            )
            (output): BertOutput(
              (dense): Linear(in_features=3072, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (4): BertLayerJit(
            (attention): BertAttentionJit(
              (self): BertSelfAttentionJit(
                (query): Linear(in_features=768, out_features=768, bias=True)
                (key): Linear(in_features=768, out_features=768, bias=True)
                (value): Linear(in_features=768, out_features=768, bias=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (output): BertSelfOutput(
                (dense): Linear(in_features=768, out_features=768, bias=True)
                (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (intermediate): BertIntermediate(
              (dense): Linear(in_features=768, out_features=3072, bias=True)
            )
            (output): BertOutput(
              (dense): Linear(in_features=3072, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (5): BertLayerJit(
            (attention): BertAttentionJit(
              (self): BertSelfAttentionJit(
                (query): Linear(in_features=768, out_features=768, bias=True)
                (key): Linear(in_features=768, out_features=768, bias=True)
                (value): Linear(in_features=768, out_features=768, bias=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (output): BertSelfOutput(
                (dense): Linear(in_features=768, out_features=768, bias=True)
                (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (intermediate): BertIntermediate(
              (dense): Linear(in_features=768, out_features=3072, bias=True)
            )
            (output): BertOutput(
              (dense): Linear(in_features=3072, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (6): BertLayerJit(
            (attention): BertAttentionJit(
              (self): BertSelfAttentionJit(
                (query): Linear(in_features=768, out_features=768, bias=True)
                (key): Linear(in_features=768, out_features=768, bias=True)
                (value): Linear(in_features=768, out_features=768, bias=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (output): BertSelfOutput(
                (dense): Linear(in_features=768, out_features=768, bias=True)
                (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (intermediate): BertIntermediate(
              (dense): Linear(in_features=768, out_features=3072, bias=True)
            )
            (output): BertOutput(
              (dense): Linear(in_features=3072, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (7): BertLayerJit(
            (attention): BertAttentionJit(
              (self): BertSelfAttentionJit(
                (query): Linear(in_features=768, out_features=768, bias=True)
                (key): Linear(in_features=768, out_features=768, bias=True)
                (value): Linear(in_features=768, out_features=768, bias=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (output): BertSelfOutput(
                (dense): Linear(in_features=768, out_features=768, bias=True)
                (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (intermediate): BertIntermediate(
              (dense): Linear(in_features=768, out_features=3072, bias=True)
            )
            (output): BertOutput(
              (dense): Linear(in_features=3072, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (8): BertLayerJit(
            (attention): BertAttentionJit(
              (self): BertSelfAttentionJit(
                (query): Linear(in_features=768, out_features=768, bias=True)
                (key): Linear(in_features=768, out_features=768, bias=True)
                (value): Linear(in_features=768, out_features=768, bias=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (output): BertSelfOutput(
                (dense): Linear(in_features=768, out_features=768, bias=True)
                (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (intermediate): BertIntermediate(
              (dense): Linear(in_features=768, out_features=3072, bias=True)
            )
            (output): BertOutput(
              (dense): Linear(in_features=3072, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (9): BertLayerJit(
            (attention): BertAttentionJit(
              (self): BertSelfAttentionJit(
                (query): Linear(in_features=768, out_features=768, bias=True)
                (key): Linear(in_features=768, out_features=768, bias=True)
                (value): Linear(in_features=768, out_features=768, bias=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (output): BertSelfOutput(
                (dense): Linear(in_features=768, out_features=768, bias=True)
                (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (intermediate): BertIntermediate(
              (dense): Linear(in_features=768, out_features=3072, bias=True)
            )
            (output): BertOutput(
              (dense): Linear(in_features=3072, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (10): BertLayerJit(
            (attention): BertAttentionJit(
              (self): BertSelfAttentionJit(
                (query): Linear(in_features=768, out_features=768, bias=True)
                (key): Linear(in_features=768, out_features=768, bias=True)
                (value): Linear(in_features=768, out_features=768, bias=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (output): BertSelfOutput(
                (dense): Linear(in_features=768, out_features=768, bias=True)
                (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (intermediate): BertIntermediate(
              (dense): Linear(in_features=768, out_features=3072, bias=True)
            )
            (output): BertOutput(
              (dense): Linear(in_features=3072, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (11): BertLayerJit(
            (attention): BertAttentionJit(
              (self): BertSelfAttentionJit(
                (query): Linear(in_features=768, out_features=768, bias=True)
                (key): Linear(in_features=768, out_features=768, bias=True)
                (value): Linear(in_features=768, out_features=768, bias=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (output): BertSelfOutput(
                (dense): Linear(in_features=768, out_features=768, bias=True)
                (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (intermediate): BertIntermediate(
              (dense): Linear(in_features=768, out_features=3072, bias=True)
            )
            (output): BertOutput(
              (dense): Linear(in_features=3072, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
      )
      (pooler): BertPooler(
        (dense): Linear(in_features=768, out_features=768, bias=True)
        (activation): Tanh()
      )
    )
  )
  (classifier): MLPClassifer(
    (layers): ModuleList(
      (0): Linear(in_features=768, out_features=768, bias=True)
      (1): BatchNorm1d(768, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU()
      (3): Dropout(p=0.5, inplace=False)
      (4): Linear(in_features=768, out_features=768, bias=True)
      (5): BatchNorm1d(768, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (6): ReLU()
      (7): Dropout(p=0.5, inplace=False)
      (8): Linear(in_features=768, out_features=2, bias=True)
    )
  )
  (losses): Losses(
    (losses): ModuleList(
      (0): MMFLoss(
        (loss_criterion): CrossEntropyLoss(
          (loss_fn): CrossEntropyLoss()
        )
      )
    )
  )
)
2021-01-18T11:38:58 | mmf.utils.general: Total Parameters: 110668034. Trained Parameters: 110668034
2021-01-18T11:38:58 | mmf.trainers.mmf_trainer: Starting inference on val set
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [01:26<00:00, 17.23s/it]
2021-01-18T11:40:24 | mmf.trainers.callbacks.logistics: val/hateful_memes/cross_entropy: 0.7017, val/total_loss: 0.7017, val/hateful_memes/accuracy: 0.6167, val/hateful_memes/binary_f1: 0.4069, val/hateful_memes/roc_auc: 0.6119
2021-01-18T11:40:24 | mmf.trainers.callbacks.logistics: Finished run in 01m 29s 250ms

Expected behavior:

For Text BERT, I expected Validation Accuracy to be 58.26%, and AUROC to be 64.65%. But I seem to have gotten 61.67% and 61.19% respectively. A similar error is gotten for Image-Grid:

val/hateful_memes/cross_entropy: 0.7144, val/total_loss: 0.7144, val/hateful_memes/accuracy: 0.6185, val/hateful_memes/binary_f1: 0.2256, val/hateful_memes/roc_auc: 0.5781

Environment:

Provide your environment information using the following command:

Collecting environment information...
PyTorch version: 1.6.0
Is debug build: No
CUDA used to build PyTorch: None

OS: Mac OSX 11.1
GCC version: Could not collect
CMake version: version 3.19.2

Python version: 3.8
Is CUDA available: No
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA

Versions of relevant libraries:
[pip3] numpy==1.19.5
[pip3] torch==1.6.0
[pip3] torchtext==0.5.0
[pip3] torchvision==0.7.0
[conda] Could not collect

kLabille commented 3 years ago

I believe the numbers reported in the FB paper were the baselines ran on the phase 1 dataset (dev_seen/test_seen) while yours must be running on the phase 2 dataset (dev_unseen, test_unseen).

You can change the dataset in the yaml config file, mine was located at /home/username/.local/lib/python3.8/site-packages/mmf/configs/datasets/hateful_memes/defaults.yaml where you can change

22    annotations:
...
25      val:
26     - hateful_memes/defaults/annotations/dev_unseen.jsonl

to :

22    annotations:
...
25      val:
26      - hateful_memes/defaults/annotations/dev_seen.jsonl

shivgodhia commented 3 years ago

Wow, thank you so much, I never thought to look at this. I was thinking the problem might have been the seed or something. I am getting much closer numbers now, though still not the same.

For example, Text BERT:

val/hateful_memes/cross_entropy: 0.7298, val/total_loss: 0.7298, val/hateful_memes/accuracy: 0.5880, val/hateful_memes/binary_f1: 0.4798, val/hateful_memes/roc_auc: 0.6528

The accuracy reported in the paper is 58.26, AUROC 64.65. Mine is seemingly 58.80 and ROC_AUC is 65.28. Those are close but is there something else affecting the score?

kLabille commented 3 years ago

You're welcome. I do have similar yet slightly different numbers as well. My image-grid validation results are actually way off compared to what the paper reports, the other baselines that I was able to run are much closer.

I have to train mine on much lower batch size because my machine can't handle large batch size. I suspect this is the reason why we get different accuracy and AUROC. It it close enough in my opinion still.

Out of curiosity, were you able to train the image-region, mmbt-region and the vilbert/viual bert ones? I can't. I have reported my issue in a previous thread. I still were nunable to fix the problem.

shivgodhia commented 3 years ago

Ahhh interesting. That's reassuring. How does one define "close enough" though? +/- 1?

I am trying to train now, but my task requires reimplementing. For a start I was using the pretrained models to see if I can reproduce the numbers as a sanity check before getting on with the other work.

I am able to "train" the Image-Grid but my machine is too slow so I terminated it. But it was making a run through the epochs so I think that it was fine. I haven't encountered your issue so can't help at the moment but will keep it in mind.

apsdehal commented 3 years ago

@kLabille Thanks for helping out @hivestrung with the issue.

@hivestrung You can't replicate the exact numbers in the paper because those are the averages over multiple runs with different seeds. That's why you will see a minor difference in the metrics but they should be in the ballpark.

shivgodhia commented 3 years ago

@apsdehal Ahhh okay thank you. Can I also check, it's not super clear in the paper or maybe I am new to this and don't know the usual practice, but for those pre-trained models evaluated on the validation set, are they fine-tuned with the training data or not at all? Thank you very much!

facebookresearch / mmf