Open Germany321 opened 2 months ago
Yes, it can occur multiple times. The current language data focuses mainly on single objects, which may limit performance with multiple referent tokens. Please refer to the instruction templates in the supplementary file for the well-trained instruction templates.
Thanks for your quik reply. Another question is that if there are multiple referent tokens in the prompt, how can you distinguish different referent scene queries? In above example, "Describe the table < /ref> and the chair < /ref>.", it seems that only decoding "< /ref>" token can not distinguish the two instances. How can you retreive the different object queries for table and chair respectively based on this referent token?
Prior scene queries can be decomposed into scene masks, enabling us to obtain the mapping between instances and queries. During training, a mask IoU greater than 0.3 is considered a positive match in supp Sec. B.
Thanks for the reply, I finally understand the mechanism.
I notice referent tokens are interleaved in the output. Can multiple referent tokens appear in a single text prompt, such as "Describe the table and the chair ."?