Simple example of getting image caption prediction

facebookresearch / mmf

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

https://mmf.sh/

Other

5.5k stars 939 forks source link

Simple example of getting image caption prediction #364

Closed greeneggsandyaml closed 1 year ago

greeneggsandyaml commented 4 years ago

Hello MMF authors,

Thank you for your nice repo. I'm new to the repo and I'm literally looking for the simplest thing: I'd like to run masked language modeling inference with one of your pretrained masked captioning models on a new image. This should be super super simple, but I'm not seeing how to do it.

For example, I'm looking to make a simple helper function that will take a PIL image and a caption string like "A train leaving a [MASK]", and get the result of VilBERT/VisualBERT. To do this, I think I need to extract features from the image in the same way that you extracted them for COCO/CC pretaining. Do you provide the feature extraction code (I probably just missed it)? Once I've extracted the features, how exactly should I preprocess the text data and input all the data to the model?

Note: I see that the MMBT model has a helpful predict interface, but the other models do not (I think?)

Thank you for all your help and your great work!

vedanuj commented 4 years ago

Supporting predict for other models like VilBERT/VisualBERT is not on our immediate roadmap. However we encourage you to submit a PR for this and we can help.

greeneggsandyaml commented 4 years ago

Thanks for the reply! This is something I can work on in the next few weeks. To clarify, how exactly should I extract the features for VilBERT/VisualBERT? It's not clear to me what exact pretrained network was used and how the features were extracted. Thanks!

apsdehal commented 4 years ago

For starters, how about creating a colab demo for these models. Here are some pointers:

The script that is used to extract the features that are used in VisualBERT/ViLBERT is present at https://github.com/facebookresearch/mmf/blob/master/tools/scripts/features/extract_features_vmb.py.
Original demo from Pythia which had a class for extracting these features is available at https://colab.research.google.com/drive/1Z9fsh10rFtgWe4uy8nvU4mQmqdokdIRR?usp=sharing. You can take bits and pieces from this code to create your own demo
build_processors method can build the relevant processors for you that you would require for processing the text. Have a look at MMBT's HM Inference example and you will understand.
For pretrained networks, you can take a look at what is available in the model zoo for visualbert and vilbert https://github.com/facebookresearch/mmf/blob/master/mmf/configs/zoo/models.yaml

Let us know if something isn't clear or if you need more help. :) Looking forward to your contribution.

gchhablani commented 4 years ago

@apsdehal @vedanuj I was going through the code at https://github.com/facebookresearch/mmf/blob/master/tools/scripts/features/extract_features_vmb.py and I wanted to understand why is this not a part of the mmf framework and present separately as tools? Is there a reason behind this?

apsdehal commented 4 years ago

@GunjanChhablani The code to extract features has a dependency on maskrcnn benchmark which requires a specific setup and which we don't want to include in our main dependencies yet. So, that's why it is kep separately in tools.

gchhablani commented 4 years ago

Hi @apsdehal, Thank you so much for replying.

parthsuresh commented 4 years ago

Hi @vedanuj, @apsdehal can I work on this?

Ganeshgarladinne commented 3 years ago

im biginer can you help plz