facebookresearch / mmf

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)
https://mmf.sh/
Other
5.49k stars 935 forks source link

Questions about GQA feature and visual_genome feature #1211

Open StanLei52 opened 2 years ago

StanLei52 commented 2 years ago

❓ Questions and Help

Hi, I have a question related to the features of some of the dataset. I downloaded features of GQA and Visual Genome from MMF. Since GQA and VG have the same source of images, the yielded features should be the same if the object detector for feature extraction is the same. However when i check the downloaded feature, i noticed that there exists some differences:

# cmp gqa and viusal_genome
image_id = 1063

# loading gqa:
env = lmdb.open(
        gqa_feat_path + '/' + 'gqa.lmdb',
        subdir=os.path.isdir(gqa_feat_path + '/' + 'gqa.lmdb'),
        readonly=True,
        lock=False,
        readahead=False,
        meminit=False,
)
with env.begin(write=False, buffers=True) as txn:
    gqa_example = pickle.loads(txn.get(str(image_id).encode()))

# loading visual genome  
vg_fc6_feat = np.load(vg_feat_path + '/' + str(image_id) + '.npy', allow_pickle=True) # (100, 2048)
vg_fc6_info = np.load(vg_feat_path + '/' + str(image_id) + '_info.npy', allow_pickle=True).item()

######### 
>> gqa_example['objects']
>> array([  43,   43,  289,  506,  506,  314,   33,  314,  307,  248,   43,
        506,  314,  307,  248,  506,  314,  289,   43,  506, 1260,  506,
        506,  314,   33,  371,   33,   33,  314, 1000,  506,  307,  506,
        248,  262,  200,  371,  221,  248, 1260,  314,  506,  248,  307,
         33,  506,  314,  314,  221, 1032,  506,  371,  506, 1260,  506,
        506,  371,  248, 1000, 1260,  506,  289,  314,  719,  262,  248,
        221,  693,  506, 1260,  248,  506,  221,  800,  200,  697,  506,
        200,   33,  506,  371,   33,  506,  248,  697,  248, 1000,  248,
        466,  221,  697,  314,  758,  371,   33,  697,  262,  506, 1260,
        289])

>>vg_fc6_info['objects']
>>array([  44,  507,  290,  315,  308,  507,  249, 1001,  315,  507,  801,
        507,  315,  315,   44,  372,  249,  507,  507,   34,  372, 1001,
        222,  315,    0,  315,  263,  308,  372,   34,  201,    0, 1261,
        249,  290,   44,  759,    0,  308, 1033,    0,    0,    0,  759,
          0,    0,    0,    0,  315,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,  222,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0])

I also check the 2048-d features, and they are also different.

I assume these two datasets (GQA and VG) used different object detectors for feature extraction? If so could you please clarify which detector is used for feature extraction? Also i need this information for textVQA.

Looking forward to your reply :)