Open JJ-res101 opened 2 years ago
The box is not annotated to match a certain phrase, but the whole sentence.
I think the box is annotated to each phrase in Flickr30K Entities data. As said in your paper, "Flickr30K Entities [38] augments the original Flickr30K [58] with short region phrase correspondence annotations." Maybe the 'Flickr' dataset you use is one box annotation per sentence. Is that right?:)
Just as you cited, "Flickr30K Entities [38] augments the original Flickr30K [58] with short region phrase correspondence annotations." which means the original sentences of Flickr30K are splited to short phrases and each phrase is annotated with a bbox. When training on Flickr30K Entities, each sample is consists of a phrase and a bbox.
Thank you for your excellent work! How does the model get the box of a certain phrase in a sentence? Right now it seems to me that the model can't do that. Is that right?