Encoding referent tokens

OpenRobotLab / Grounded_3D-LLM

Code&Data for Grounded 3D-LLM with Referent Tokens

https://groundedscenellm.github.io/grounded_3d-llm.github.io/

88 stars 2 forks source link

Encoding referent tokens #6

Open Germany321 opened 2 months ago

Germany321 commented 2 months ago

I notice referent tokens are interleaved in the output. Can multiple referent tokens appear in a single text prompt, such as "Describe the table and the chair ."?

chenyilun95 commented 2 months ago

Yes, it can occur multiple times. The current language data focuses mainly on single objects, which may limit performance with multiple referent tokens. Please refer to the instruction templates in the supplementary file for the well-trained instruction templates.

Germany321 commented 2 months ago

Thanks for your quik reply. Another question is that if there are multiple referent tokens in the prompt, how can you distinguish different referent scene queries? In above example, "Describe the table < /ref> and the chair < /ref>.", it seems that only decoding "< /ref>" token can not distinguish the two instances. How can you retreive the different object queries for table and chair respectively based on this referent token?

chenyilun95 commented 2 months ago

Prior scene queries can be decomposed into scene masks, enabling us to obtain the mapping between instances and queries. During training, a mask IoU greater than 0.3 is considered a positive match in supp Sec. B.

Germany321 commented 2 months ago

Thanks for the reply, I finally understand the mechanism.