BAAI-WuDao / BriVL

Bridging Vision and Language Model
MIT License
279 stars 31 forks source link

new image bbox #1

Open 21157651 opened 3 years ago

21157651 commented 3 years ago

how to get 'bbox' in BriVL/BriVL-code-inference/data/jsonls/example.jsonl

knaffe commented 3 years ago

The bbox of these examples have 100 ROI. How to use FasterRCNN to detect these crazy numbers of Objects?

MischaQI commented 3 years ago

I used Detectron2 with weight _mask_rcnn_R_50_FPN3x.yaml to get 100 candidate bboxes, but the coordinates are not exactly the same as that in the example.jsonl. So I'd like to know if the object detector used in the project could be provided for complete reproduction.

chuhaojin commented 3 years ago

BriVL uses the Bottom-Up Attention model as its object detection tool, this model can be obtained fromBriVL-BUA-applications

knaffe commented 3 years ago

By the way, I have test the AIC-ICC Validset from BriVL-API1.0, but the retrieval result is too low (Recall@1 < 1%). I use your released code for 'retrieval ' and Faiss vector retrieval, but still have disappointing result. Could you release more details in this Exp of Paper?

zgj-gutou commented 3 years ago

BriVL uses the Bottom-Up Attention model as its object detection tool, this model can be obtained fromBriVL-BUA-applications

hi, I used BriVL-BUA-applications to get the bboxes. I modified the extract-bua-caffe-r101.yaml file. The MAX_BOXES is changed from 45 to 100, but the coordinates are not exactly the same as that in the example.jsonl. So I am a little confused. I don't know what is wrong during using the BriVL-BUA-applications. Can you give a example from the example.jsonl and generate bboxes of it by using BriVL-BUA-applications. Thank you very much!

chuhaojin commented 3 years ago

BriVL uses the Bottom-Up Attention model as its object detection tool, this model can be obtained fromBriVL-BUA-applications

hi, I used BriVL-BUA-applications to get the bboxes. I modified the extract-bua-caffe-r101.yaml file. The MAX_BOXES is changed from 45 to 100, but the coordinates are not exactly the same as that in the example.jsonl. So I am a little confused. I don't know what is wrong during using the BriVL-BUA-applications. Can you give a example from the example.jsonl and generate bboxes of it by using BriVL-BUA-applications. Thank you very much!

Due to the difference between the library versions or the machines, the results of the bounding box will be slightly random, which will not affect the performance of BriVL. In addition, you can calculate the IoU values of these two sets of bounding boxes to verify their correctness.

chuhaojin commented 3 years ago

We just fixed a bug: Change the image size in cfg/test.yml to 380. Please pay attention to this when using BriVL, sorry for the inconvenience.

troilus-canva commented 3 years ago

BriVL uses the Bottom-Up Attention model as its object detection tool, this model can be obtained fromBriVL-BUA-applications

hi, I used BriVL-BUA-applications to get the bboxes. I modified the extract-bua-caffe-r101.yaml file. The MAX_BOXES is changed from 45 to 100, but the coordinates are not exactly the same as that in the example.jsonl. So I am a little confused. I don't know what is wrong during using the BriVL-BUA-applications. Can you give a example from the example.jsonl and generate bboxes of it by using BriVL-BUA-applications. Thank you very much!

I can reproduce the bboxes same as those in example.jsonl

zgj-gutou commented 3 years ago

BriVL uses the Bottom-Up Attention model as its object detection tool, this model can be obtained fromBriVL-BUA-applications

hi, I used BriVL-BUA-applications to get the bboxes. I modified the extract-bua-caffe-r101.yaml file. The MAX_BOXES is changed from 45 to 100, but the coordinates are not exactly the same as that in the example.jsonl. So I am a little confused. I don't know what is wrong during using the BriVL-BUA-applications. Can you give a example from the example.jsonl and generate bboxes of it by using BriVL-BUA-applications. Thank you very much!

I can reproduce the bboxes same as those in example.jsonl

hello, how can you do that ? Can you tell me what is changed in the extract-bua-caffe-r101.yaml file ? thank you!!!

troilus-canva commented 3 years ago

BriVL uses the Bottom-Up Attention model as its object detection tool, this model can be obtained fromBriVL-BUA-applications

hi, I used BriVL-BUA-applications to get the bboxes. I modified the extract-bua-caffe-r101.yaml file. The MAX_BOXES is changed from 45 to 100, but the coordinates are not exactly the same as that in the example.jsonl. So I am a little confused. I don't know what is wrong during using the BriVL-BUA-applications. Can you give a example from the example.jsonl and generate bboxes of it by using BriVL-BUA-applications. Thank you very much!

I can reproduce the bboxes same as those in example.jsonl

hello, how can you do that ? Can you tell me what is changed in the extract-bua-caffe-r101.yaml file ? thank you!!!

I didn't change anything except device from cuda to cpu, as I'm running it on mac. And run the command mentioned in the readme python3 bbox_extractor.py --img_path ../BriVL/data/imgs/baike_14014334_0.jpg --out_path test_data/test1.npz

Qiulin-W commented 2 years ago

By the way, I have test the AIC-ICC Validset from BriVL-API1.0, but the retrieval result is too low (Recall@1 < 1%). I use your released code for 'retrieval ' and Faiss vector retrieval, but still have disappointing result. Could you release more details in this Exp of Paper?

Hi, I got a similar results as yours on the AIC-ICC validation set (30000 images with 5 captions for each image): i2t R1: 1.57%, t2i R1:0.48% After going into details, I found the model did provide some resonable results, e.g. Screenshot_from_2021-10-19_17-42-47

The highlighted text in the bottom-left is the query text, the ground truth image is above the text. The three images on the right are the top-3 images matched by the model. However, as the example shows, the model only matches words "裙子", "女孩" and ignores other information, which severely affect the recall.

Moreover, I found another paper (https://arxiv.org/abs/2109.04699v2) that did the same evaluation on AIC-ICC dataset. In their paper, they mentioned that they conducted experiments on the "test subset" of AIC-ICC, which only contains 10000 "data". The results reported in their paper about the WenLan model are similar as those reported in WenLan paper. But the validation set contains 30000 images and 150000 captions instead.

E-CLIP_dataset_detail

May the authors @chuhaojin provide more details of the test set and potential pre-processing procedures? Many thanks!

chuhaojin commented 2 years ago

@Qiulin-W @knaffe @chuhaojin The following results are tested on AIC-ICC validate dataset using the code of this repo. I can ensure that the processing results of jsonl file are exactly the same as the file provided in the example. image

This result is far inferior to the result in the paper. Any suggestions?

@huang-xx @knaffe Sorry, I don’t know more about the evaluation details of the BriVL model. You can consult the student in the Model Development Group(@moonlitt, who takes charge of this part) for more details.

jim4399266 commented 2 years ago

@moonlitt Hello, my evaluation results (i2t r@1 1.09% ; t2i r@1 0.37% ) on AIC-ICC validate dataset(I used 30000 samples) are also far from the results in paper. Could you please share the evaluation codes as a reference?