Hi, thank you for your amazing work!
I want to know how did you train your Region Proposal Network?
In section 1, you said, "We introduce an open-vocabulary object detector method to learn object-language alignments directly from image-text pair data." It sounds like you didn't use any annotation bounding boxes.
However, In section 3.1, you said, 'our goal is to build an object detector, trained on a dataset with base-class bounding box annotations and a dataset of image-caption pairs 〈 I, C 〉 associated with a large vocabulary C_open'. It sounds like some bounding boxes are used for supervision.
It confused me a lot. In my opinion, maybe you use the ground truth bounding box of base classes to train the RPN.
Kindly look forward to your reply. Thank you very much.
Hi, thank you for your amazing work! I want to know how did you train your Region Proposal Network? In section 1, you said, "We introduce an open-vocabulary object detector method to learn object-language alignments directly from image-text pair data." It sounds like you didn't use any annotation bounding boxes. However, In section 3.1, you said, 'our goal is to build an object detector, trained on a dataset with base-class bounding box annotations and a dataset of image-caption pairs 〈 I, C 〉 associated with a large vocabulary C_open'. It sounds like some bounding boxes are used for supervision.
It confused me a lot. In my opinion, maybe you use the ground truth bounding box of base classes to train the RPN.
Kindly look forward to your reply. Thank you very much.