[Hateful Memes] Features I extract aren't the same as detectron.lmdb

shivgodhia commented 3 years ago

🐛 Bug

extract_features_frcnn.py might not be extracting the same features as what was used for the original dataset

To Reproduce

It's tricky to do so, but basically I created a Predictor class that loads the model, and takes in an image (path to the png file) and text and does the required transforms on the data before building a sample list and running it through the model.

Using this, I ran the model on all the images in the validation set and computed the statistics. I also did the same using mmf_run to see what the official implementation of the model would get.

This worked perfectly (totally identical acc and roc_auc scores down to the fourth decimal place) for Image-Grid, Text BERT, Concat BERT and Late Fusion. It did not worl for Visual BERT and when I tried it for Image-Region (which uses features) it also did not work.

Thus I conclude that there is a problem somewhere related to feature extraction. It could be that I'm not constructing the sample list with the features correctly, or that the features themselves are very different and not usable in the model

code

The is for Image-Region

def test_valSet():
    target_metrics = "val/hateful_memes/accuracy: 0.5759, val/hateful_memes/binary_f1: 0.1358, val/hateful_memes/roc_auc: 0.4790"
    VAL_SET_PATH = os.path.join(HATEFUL_MEMES_PATH, "dev_seen.jsonl")

    test_predictor = BasePredictorTest("Image-Region", HATEFUL_MEMES_PATH, VAL_SET_PATH)
    _, meter = test_predictor.evaluate_dataset()

    metrics = ", ".join(x for x in str(meter).split(", ")[1:])
    print(metrics)
  # run with Junqq/BrettAllen feature extraction: val/hateful_memes/accuracy: 0.4860, val/hateful_memes/binary_f1: 0.5499, val/hateful_memes/roc_auc: 0.5198
    assert(metrics == target_metrics)

test_valSet()

Similar concept for Visual BERT

Expected behavior

Same or at least very similar acc and auroc scores on the validation set are expected

Image-Region ground truth: val/hateful_memes/accuracy: 0.5759, val/hateful_memes/binary_f1: 0.1358, val/hateful_memes/roc_auc: 0.4790 Image-Region what I got: val/hateful_memes/accuracy: 0.4860, val/hateful_memes/binary_f1: 0.5499, val/hateful_memes/roc_auc: 0.5198

Note the wildly varying scores for Visual BERT COCO

Visual BERT COCO ground truth: val/hateful_memes/accuracy: 0.6840, val/hateful_memes/binary_f1: 0.6010, val/hateful_memes/roc_auc: 0.7559 Visual BERT COCO what I got: val/hateful_memes/accuracy: 0.5540, val/hateful_memes/binary_f1: 0.3989, val/hateful_memes/roc_auc: 0.6127

Environment

You can run the script with:

# For security purposes, please check the contents of collect_env.py before running it.
python -m torch.utils.collect_env

Collecting environment information... PyTorch version: 1.8.1 Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A

OS: macOS 11.1 (x86_64) GCC version: Could not collect Clang version: 12.0.0 (clang-1200.0.32.28) CMake version: version 3.19.2

Python version: 3.8 (64-bit runtime) Is CUDA available: False CUDA runtime version: No CUDA GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A

Versions of relevant libraries: [pip3] numpy==1.20.2 [pip3] pytorch-lightning==1.2.7 [pip3] torch==1.8.1 [pip3] torchmetrics==0.3.0 [pip3] torchtext==0.5.0 [pip3] torchvision==0.9.1 [conda] Could not collect

shivgodhia commented 3 years ago

@vedanuj Sorry to bother you, I noticed this repository is yours: https://gitlab.com/vedanuj/vqa-maskrcnn-benchmark

I have tried extracting features using extract_features_vmb.py instead. I then used lmdb_conversion to extract the detectron.lmdb features that are automatically downloaded from fb servers (they are stored in /home/sgg29/.cache/torch/mmf/data/datasets/hateful_memes/defaults/features/detectron.lmdb). Call this the ground truth features.

Then I compared the features extracted using extract_features_vmb.py by loading them using numpy, against those that were extracted (above paragraph).

The shape is finally (100, 2048) for both, so I'm on the right track (Brett's feature extractor extracted features with shape (36, 2048). But the numpy arrays loaded are not the same for the same image, so the features are still different (not sure how different, but they're different).

Can you tell me how exactly detectron.lmdb was created?

shivgodhia commented 3 years ago

Referring to the hateful memes paper:

We evaluate two image encoders: 1) standard ResNet-152 [30] convolutional features from res-5c with average pooling (Image-Grid) 2) features from fc6 layer of Faster-RCNN [60] with ResNeXt- 152 as its backbone [86]. The Faster-RCNN is trained on Visual Genome [43] with attribute loss following [69] and features from fc6 layer are fine-tuned using weights of the fc7 layer (Image- Region). For the textual modality, the unimodal model is BERT [14] (Text BERT).

I don't get the last bit: "and features from fc6 layer are fine-tuned using weights of the fc7 layer". I think this is what I'm missing. How do I do that?

facebookresearch / mmf