aimagelab / show-control-and-tell

Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions. CVPR 2019
https://arxiv.org/abs/1811.10652
BSD 3-Clause "New" or "Revised" License
282 stars 62 forks source link

How to filter the detected objects? #14

Open Wangzhen-kris opened 5 years ago

Wangzhen-kris commented 5 years ago

Hi, Thanks for sharing. But I have a question, I hope I can get the answer. We can usually get 36 detection objects using Faster R-CNN, but I see that det_sequencesusually contain only a few objects in file coco_entities_release.json. I‘m not sure which mechanism in the model implements the filter effect from 36 objects to several objects? Can I understand this? Sorting Network plays a role. Because the sorting network ranks the regions of higher importance in the front, and only the first few regions of the sequence of region sets are used to generate the caption. So there are still a lot of regions that have not been used to be filtered out. Or is it that Adaptive attention with visual sentinel? Thank U.