facebookresearch / mmf

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)
https://mmf.sh/
Other
5.48k stars 934 forks source link

How to get image/text tensor? #1140

Open ayulockin opened 2 years ago

ayulockin commented 2 years ago

❓ Questions and Help

Apologies if this is already covered somewhere or I am missing something, but I am unsure about how and where the actual image and text tensors are given to the model as input.

Context

I am trying to build a model prediction visualizer using W&B Tables. I have noticed that in the default.yaml file if I do evaluation.predict=True, it will write a .json/.csv file with question_id, image_id, answer vocab source. This is great but it can be made much more useful if we can interactively look at the data and model prediction.

What I am trying to build?

The screenshot shown below is an example where W&B Tables is used to visualize model prediction of YOLOv5 for the COCO dataset.

image

I am trying to build something similar and here's the screenshot of my barebones Tables:

image

I started building this on top of TestReporter inside the test_reporter.py file.

Where am I stuck?

As you can see I am only logging the question_id and image_id but not the actual question string and image. The evaluation.predict=True calls prediction_loop inside evaluation_loop.py file. When I inspect the prepared_batch it gives a SampleList and have question_id and image_id beside other features/info.

I want to know how I can parse these ids to get the actual tensor? Or how should I get the actual text from tokenized text inside this SampleList?

Basically, I want to understand the MMF method of parsing each data sample that I can log to build the Table.

PS: I have gone through the available documentation and did my own digging through the codebase but I feel lost. Would appreciate any feedback or direction of approaching this. If something is not clear please ask away. :)

TownWilliam commented 2 years ago

Hello. The image inputs in the model are usually several features tensors but images tensors. When the model is training or testing, it has nothing about the image itself. The image file could be downloaded in its datasets website.

In the past, I wrote a simple checking and finding program to find the corresponding image_id in the image files. This looks a little toilsome but is a method.

For the texts inside the SampleList, it could be converted by the function object_to_byte_tensor( ) or byte_tensor_to_object() in the mmf\utils\distributed.py https://github.com/facebookresearch/mmf/blob/b672a745996eb0549a0b903a30a225a8f0668182/mmf/utils/distributed.py#L244

https://github.com/facebookresearch/mmf/blob/b672a745996eb0549a0b903a30a225a8f0668182/mmf/utils/distributed.py#L264

That functions were also used in the prediction script for converting the text. https://github.com/facebookresearch/mmf/blob/b672a745996eb0549a0b903a30a225a8f0668182/mmf/datasets/builders/textvqa/dataset.py#L42-L55

Perhaps you could use this function in the SampleList part to get the converted results and record it in the file. However, I do not realize more details about this two functions. That is all my known.