clip-vil / CLIP-ViL

[ICLR 2022] code for "How Much Can CLIP Benefit Vision-and-Language Tasks?" https://arxiv.org/abs/2107.06383
MIT License
401 stars 35 forks source link

Pythia Feature Extraction #28

Closed shamanthak-hegde closed 2 years ago

shamanthak-hegde commented 2 years ago

I'm trying to extract image features for vqa with pythia using python pythia_clip_grid_feature.py --config-file configs/R-50-grid.yaml --dataset coco_2015_train --model_type RN50. Isn't this supposed to give output as 100 object features of 2048 dimension. I'm getting varying outputs of dimensions, (1, 13, 20, 2048), (1, 15, 20, 2048) and similar ones. Could anyone point out where I'm wrong and where I need to make changes to get (100, 2048) output for an image. And what should be the format of annotation file if I am to use VQA dataset, since that doesn't have attributes like area, segmentation, categories, etc Thanks

sIncerass commented 2 years ago

Hi there,

Thanks for the interest. I think you have to specify the transform to resize the image to the same dimension like 600 * 1000, then the resulting shape will be the same.

The annotation file of vqa is like this json file link which is constructed by Image, Question and Answer.

Thanks,