Hateful memes baselines don't seem to be predicting correctly

According to the docs under the Hateful Memes directory, I should be able to run

mmf_predict config=<REPLACE_WITH_BASELINE_CONFIG> \
  model=<REPLACE_WITH_MODEL_KEY> \
  dataset=hateful_memes \
  run_type=test \
  checkpoint.resume_zoo=<REPLACE_WITH_PRETRAINED_ZOO_KEY>

And output a reasonably perfomant csv for submission. Specifically, we are running the Visual Bert baselines with

mmf_predict config=projects/hateful_memes/configs/visual_bert/defaults.yaml
  model=visual_bert \
  dataset=hateful_memes \
  run_type=test \
  checkpoint.resume_zoo=visual_bert.finetuned.hateful_memes.from_coco

, since it seems like running with the from_pretrained flag on when using from_coco.yaml was only meant for training (inferring with that config gave variable predictions).

Running the mmf_run variant of the above command on validation gives a good AUROC (0.73 ish). However, when we submit the test csv we've been getting AUROC scores on the order of ~0.3.., which seems rather odd. Is this designated behavior? Are we not using the right configs here? We've also tried training our own models from the from_coco.yaml as a starting point, but are also encountering low AUROC test scores despite high val scores. Highly suspecting that something weird is going on with the inference flow, but by inspection nothing seems to be clearly incorrect...

facebookresearch / mmf

Hateful memes baselines don't seem to be predicting correctly #288