facebookresearch / mmf

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)
https://mmf.sh/
Other
5.46k stars 932 forks source link

M4C TextVQA bbox information in annotation_db and feature_db inconsistent? #1213

Closed StanLei52 closed 2 years ago

StanLei52 commented 2 years ago

❓ Questions and Help

Thank you for the wonderful MMF! I have a question related to the TextVQA annotation and extracted features used in M4C. I noticed that in M4C-dataset, it used features in textvqa/defaults/features/open_images/detectron.lmdb and bbox info (normalized box) in textvqa/defaults/annotations/imdb_train_ocr_en.npy. However, the bbox info in detectron.lmdb seems to be different to that in imdb_train_ocr_en.py. To reproduce:

textvqa_mmf = np.load('/path_to_dl/.cache/torch/mmf/data/datasets/textvqa/defaults/annotations/imdb_train_ocr_en.npy', allow_pickle=True)

# use index 2 as example, other indices have the same problem
k = textvqa_mmf[2]['image_id']
k = f'train/{k}'.encode()
env = lmdb.open(
        textvqa_feat_path + '/' + 'detectron.lmdb',                         # mmf/data/datasets/textvqa/defaults/features/open_images
        subdir=os.path.isdir(textvqa_feat_path + '/' + 'detectron.lmdb'),
        readonly=True,
        lock=False,
        readahead=False,
        meminit=False,
)
with env.begin(write=False, buffers=True) as txn:
        info = pickle.loads(txn.get(k))

obj_normalized_boxes = textvqa_mmf[2]['obj_normalized_boxes']
bbox = info['bbox']
h,w = info['image_height'], info['image_width']

print(w,h)
print(obj_normalized_boxes[:5], '\n------------------')
print(obj_normalized_boxes[:5]*[w,h,w,h], '\n-------------------')
print(bbox[:5])

and the corresponding output was:

1024 667
[[0.6726976  0.3750508  0.7172746  0.43353033]
 [0.         0.32832143 0.6229808  0.759303  ]
 [0.03001874 0.37756875 0.2156691  0.62390286]
 [0.725957   0.36927903 0.75591475 0.41596937]
 [0.8151331  0.37255615 0.8881252  0.57103795]] 
------------------
[[688.84234619 250.15889224 734.48919678 289.16473055]
 [  0.         218.99039188 637.93231201 506.45508349]
 [ 30.73918724 251.8383573  220.84515381 416.14320582]
 [743.37994385 246.309111   774.05670166 277.45157099]
 [834.69628906 248.49495202 909.44018555 380.8823114 ]] 
-------------------
[[  43.625504  241.9476    254.8368    422.45166 ]
 [  97.672585  206.23537   673.9255    487.85403 ]
 [ 229.36331   173.47018   982.47      514.4217  ]
 [ 329.56985    21.870787 1039.6863    370.96887 ]
 [ 652.13275   177.89989   687.8305    239.47253 ]]

I think obj_normalized_boxes should be yielded from bbox in the feature file, but from the above result, it seems that they do not have the same order. I wonder if there is something wrong? If we use features in detectron.lmdb and use bbox info in imdb_train_ocr_en.npy, the bbox info should keep consistent between these two files.

Looking forward to your reply.

StanLei52 commented 2 years ago

@ronghanghu

ronghanghu commented 2 years ago

Hi, I think this is because we later switched from lmdb from Detectron (Caffe2) features to lmdb from maskrcnn-benchmark features. We found that this change slightly boosts the TextVQA and TextCaps scores, but it might have introduced the bounding box discrepancy as you mentioned.

If you would like to use the exact features from Caffe2 (which is used in LoRRA and M4C papers), they can be downloaded by adding textvqa.caffe2 to zoo_requirements and using textvqa/caffe2/features/open_images/detectron.lmdb as the feature path, like in https://github.com/facebookresearch/mmf/blob/582c7195cbf1eb948436b66c1e9e4bb2e5652a27/projects/m4c_captioner/configs/m4c_captioner/textcaps/with_caffe2_feat.yaml#L6-L16

One can edit the lines in M4C config https://github.com/facebookresearch/mmf/blob/582c7195cbf1eb948436b66c1e9e4bb2e5652a27/projects/m4c/configs/textvqa/defaults.yaml#L8-L17 to change to the Caffe2 feature lmdbs

StanLei52 commented 2 years ago

Thank you for your reply @ronghanghu.

So obj_normalized_boxes in imdb_train_ocr_en.npy is from detectron (caffe) and bbox in detectron.lmdb is from maskrcnn-benchmark, is it correct? Since the features and bbox in the same detectron.lmdb is consistent, can we calculate the obj_normalized_boxes using bbox and its image width and height by:

orig_boxes = sample.image_info_0.bbox
w, h = sample.image_info_0.image_width, sample.image_info_0.image_height
normalized_boxes = orig_boxes / np.array([w,h,w,h])
sample.obj_bbox_coordinates = self.copy_processor(
    {"blob": normalized_boxes}
)["blob"]

instead of using normalized bbox in annotation:

# 2. Load object
# object bounding box information
## fetched by mmf sample info
# if "obj_normalized_boxes" in sample_info and hasattr(self, "copy_processor"):    # use copy_processor to convert to torch tensor
#     sample.obj_bbox_coordinates = self.copy_processor(
#         {"blob": sample_info["obj_normalized_boxes"]}
#     )["blob"]

Also you mentioned the slight boosts by using the new feature extractor. I do not understand why it can boost the score since the feature and the obj_normalized_boxes do not match (i assume the feature and bbox in the same feature file always match if i understand correctly).

ronghanghu commented 2 years ago

can we calculate the obj_normalized_boxes using bbox and its image width and height by

Yes, you can do this and directly compute the bounding boxes from the lmdb features.

Also you mentioned the slight boosts by using the new feature extractor. I do not understand why it can boost the score since the feature and the obj_normalized_boxes do not match (i assume the feature and bbox in the same feature file always match if i understand correctly).

There was only a minor boost in the scores. It was probably that the features extracted from maskrcnn-benckmark was slightly better and gave a small improvement despite the discrepancy in the bounding boxes. You can use the caffe2 lmdbs to get the exact setting in the M4C paper.

StanLei52 commented 2 years ago

Good to know, thank you Ronghang!