Pythia Feature Extraction

clip-vil / CLIP-ViL

[ICLR 2022] code for "How Much Can CLIP Benefit Vision-and-Language Tasks?" https://arxiv.org/abs/2107.06383

MIT License

401 stars 35 forks source link

I'm trying to extract image features for vqa with pythia using python pythia_clip_grid_feature.py --config-file configs/R-50-grid.yaml --dataset coco_2015_train --model_type RN50. Isn't this supposed to give output as 100 object features of 2048 dimension. I'm getting varying outputs of dimensions, (1, 13, 20, 2048), (1, 15, 20, 2048) and similar ones. Could anyone point out where I'm wrong and where I need to make changes to get (100, 2048) output for an image. And what should be the format of annotation file if I am to use VQA dataset, since that doesn't have attributes like area, segmentation, categories, etc Thanks

clip-vil / CLIP-ViL

Pythia Feature Extraction #28