Hi, Thanks for sharing.
But I have a question, I hope I can get the answer.
We can usually get 36 detection objects using Faster R-CNN, but I see that det_sequencesusually contain only a few objects in file coco_entities_release.json. I‘m not sure which mechanism in the model implements the filter effect from 36 objects to several objects?
Can I understand this? Sorting Network plays a role. Because the sorting network ranks the regions of higher importance in the front, and only the first few regions of the sequence of region sets are used to generate the caption. So there are still a lot of regions that have not been used to be filtered out.
Or is it that Adaptive attention with visual sentinel?
Thank U.
Hi, Thanks for sharing. But I have a question, I hope I can get the answer. We can usually get 36 detection objects using Faster R-CNN, but I see that
det_sequences
usually contain only a few objects in filecoco_entities_release.json
. I‘m not sure which mechanism in the model implements the filter effect from 36 objects to several objects? Can I understand this? Sorting Network plays a role. Because the sorting network ranks the regions of higher importance in the front, and only the first few regions of the sequence of region sets are used to generate the caption. So there are still a lot of regions that have not been used to be filtered out. Or is it that Adaptive attention with visual sentinel? Thank U.