Questions about location tokens

OpenGVLab / VisionLLM

VisionLLM Series

https://arxiv.org/abs/2305.11175

Apache License 2.0

865 stars 22 forks source link

Questions about location tokens #4

Open Deephome opened 1 year ago

Deephome commented 1 year ago

Hi, your work is great! But I am confused about the location tokens you used in Decoder, could you provide more details it?

kahnchana commented 1 year ago

Same here, I am trying to figure out what they are.

Are they a fixed grid or some learnable parameter?

kahnchana commented 1 year ago

I think it appears to be the same grid like structure used in deformable DETR. Basically it's a uniform grid across image coordinates, and each grid centre is used as an anchor, over which the model regresses the deviation of correct bbox.

SxJyJay commented 8 months ago

Basically it's a uniform grid across image coordinates, and each grid centre is used as an anchor, over which the model regresses the deviation of correct bbox.

Hi, I have a similar problem with you. If VisionLLM uses the Deformable DETR-like decoder, and object queries act as positional anchors, the Hungarian matching is required to assign GT boxes to object queries. However, the authors don't mention that in the paper. What do you think of the possible training details of these object queries?

chagmgang commented 8 months ago

Basically it's a uniform grid across image coordinates, and each grid centre is used as an anchor, over which the model regresses the deviation of correct bbox.

Hi, I have a similar problem with you. If VisionLLM uses the Deformable DETR-like decoder, and object queries act as positional anchors, the Hungarian matching is required to assign GT boxes to object queries. However, the authors don't mention that in the paper. What do you think of the possible training details of these object queries?

Also, Same here.

https://github.com/OpenGVLab/VisionLLM/issues/7#issue-2120370842

maybe below link will be helpful

https://openreview.net/forum?id=Vx1JadlOIt&noteId=616Bhd6O5S

maybe below image will be helpful too