aurooj / WSG-VQA-VLTransformers

Weakly Supervised Grounding for VQA in Vision-Language Transformers
MIT License
16 stars 2 forks source link

the dimension problem #5

Closed zxzhou9 closed 1 year ago

zxzhou9 commented 1 year ago

The previous two-stage training mscoco provided a feature dimension of [36,2048], and the data feature dimension downloaded from the gqa link you pointed to was [7,7,2048], which is actually [49,2048]. As a result, if you use mscoco's training model to do this gqa fine-tuning, it will not match up in the data input dimension

aurooj commented 1 year ago

@zxzhou9 Can you please confirm the input feature size for MSCOCO features? For MSCOCO, we extracted resnet101 features of same size and use those to pretrain our model.

zxzhou9 commented 1 year ago

@aurooj I got that, I changed two files. 1. modeling_capsbert.py I changed hw from 7 to 6. 2. lxmert_pretrain.py -> class LXMERT: ->forward I changed h from 7 to 6> While, if I change 6 back to 7, there would be an error "all inputs arrays must have the same shape". I wonder if this error was raised by my .tsv(.hdf5) file which was downloaded by the instruction of Hao Tan, could you please upload your train2014_obj36.tsv or other format for reference.