Open Deephome opened 1 year ago
Same here, I am trying to figure out what they are.
Are they a fixed grid or some learnable parameter?
I think it appears to be the same grid like structure used in deformable DETR. Basically it's a uniform grid across image coordinates, and each grid centre is used as an anchor, over which the model regresses the deviation of correct bbox.
Basically it's a uniform grid across image coordinates, and each grid centre is used as an anchor, over which the model regresses the deviation of correct bbox.
Hi, I have a similar problem with you. If VisionLLM uses the Deformable DETR-like decoder, and object queries act as positional anchors, the Hungarian matching is required to assign GT boxes to object queries. However, the authors don't mention that in the paper. What do you think of the possible training details of these object queries?
Basically it's a uniform grid across image coordinates, and each grid centre is used as an anchor, over which the model regresses the deviation of correct bbox.
Hi, I have a similar problem with you. If VisionLLM uses the Deformable DETR-like decoder, and object queries act as positional anchors, the Hungarian matching is required to assign GT boxes to object queries. However, the authors don't mention that in the paper. What do you think of the possible training details of these object queries?
Also, Same here.
https://github.com/OpenGVLab/VisionLLM/issues/7#issue-2120370842
maybe below link will be helpful
https://openreview.net/forum?id=Vx1JadlOIt¬eId=616Bhd6O5S
maybe below image will be helpful too
Hi, your work is great! But I am confused about the location tokens you used in Decoder, could you provide more details it?