Closed zhuyinna closed 7 months ago
This has been questioned in our rebuttal period. Here we post our answer for clarification. Thanks for your attention to our work : )
Our PRD learns the prompt embeddings implicitly. While the learnable queries weren't constrained to focus on specific semantics, it's interesting to note that these four (of $Q$=16 in Tab. 1) visualizations that we picked out seemed to align with just the right parts of the human body. We believe that introducing additional prior knowledge or constraints in the PRD might further contribute to better learning human semantics, which is left for our future work.
Hi, I see you wrote that in your paper: "By revisiting how people perceive a person image, we find several common characteristics, i.e., human body parts, age, gender, hairstyle, clothing, and so on, as demonstrated in Fig. 1(a)" But how do you make sure that these transformerblocks extract the right information that you want but not something else ? The shape of the hidden_states that the decoder outputs is [batchsize, 8, 768]. I want to know how these eight kinds of information are decoupled from the other of the Irrelevant information.
Thanks very much!