How did you train your Region Proposal Network?

Hi, thank you for your amazing work! I want to know how did you train your Region Proposal Network? In section 1, you said, "We introduce an open-vocabulary object detector method to learn object-language alignments directly from image-text pair data." It sounds like you didn't use any annotation bounding boxes. However, In section 3.1, you said, 'our goal is to build an object detector, trained on a dataset with base-class bounding box annotations and a dataset of image-caption pairs 〈 I, C 〉 associated with a large vocabulary C_open'. It sounds like some bounding boxes are used for supervision.

It confused me a lot. In my opinion, maybe you use the ground truth bounding box of base classes to train the RPN.

Kindly look forward to your reply. Thank you very much.

clin1223 / VLDet

How did you train your Region Proposal Network? #14