Open YellowPig-zp opened 2 years ago
For the first question, you are right, and it seems a bug in our implementations. For the second, we only use PE(xy) as positional queries, see this line, which will slice the PE(xywh) to PE(xy). By the way, we will use projected (PE(xywh)) as pos query for self-attention.
Hi, I was confused about temperature for pos emb. Thus, is your code available in github different with code of paper? Or, Did you use temperature of 10000 ?
I think the temp=20 currently is used in the image position encoding only, see main.py and position_encoding.py. Agree with @YellowPig-zp, this temp=20 should be also applied to the box position in the cross attention, which is also ranged from 0 ~ 1.
In the paper, there is a section saying the optimal temperature for positional embedding is 20 in your model. However, this line under gen_sineembed_for_position indicates that a value of 10000 is used for the temperature. Is there any part I missed when I am trying to understand the codes?
Besides, the paper also says that only x and y coordinates are used to generate positional embedding for the cross-attention, but this line, despite commenting as num_queries x batch_size x 2, actually operates on num_queries x batch_size x 4 if printing out the tensor shape. Does this perform better than only using x&y or they are similar in performance?