Question Regarding Figure 1 from Paper

encounter1997 / FP-DETR

Official Implementation of "FP-DETR: Detection Transformer Advanced by Fully Pre-training"

Apache License 2.0

60 stars 2 forks source link

Question Regarding Figure 1 from Paper #2

Closed michaelku1 closed 2 years ago

michaelku1 commented 2 years ago

Thanks for your contribution in DETR pretraining. I do have a few questions about Figure 1. from the paper. At the fine-tuning stage, as indicated from the figure, the left part is the input patches and the right part if query content embeddings + query positional embedding. If query positional embedding act as visual prompts in this case, what exactly do you mean by "query content embeddings" and where can I find them in the figure?

My other question is that, given the assumption that identifying object regions can be understood as visual prompting, how does adding a task adaptor (modelling object relations) help in achieving better visual prompting? Is there any intuition other than the fact that it models object relations better? Thanks.

encounter1997 commented 2 years ago

Hi, thanks for your interest. (1) The right part is indeed the query content embeddings + query positional embedding (visual prompt). We omitted the layer-wise adding in the figure for clarity. (2) The task adaptor operates on the combination of visual prompt and query content embeddings, enabling both spatial and semantic information exchange, which should help the model better identify the object locations. I think "Relation Networks for Object Detection" may share a similar intuition for the reference.

michaelku1 commented 2 years ago

Thank you. This is very helpful.

michaelku1 commented 2 years ago

Hello again, there is something I would like to ask to clear the rest of my doubts. So I have skimmed through the prompt-based learning survey paper to understand what textual prompting is doing. I'd just like to make sure that the similarities drawn between textual and visual prompts are correct and consistent:

For visual prompt,

the "template" in visual prompt can be understood as 2D reference points mapped from positional embeddings
the "input" to the template is the query content embeddings (a.k.a object queries)
the task adaptor is used in order to search for the best template (as stated in the paper "to enhance the model's ability to prompt")

For your reference, the following table is the one extracted from "pretrain, prompt, predict" survey paper

I mostly work with vision and am no expert in NLP, so I'd just like to make sure I understand the usage correctly. Many Thanks!

encounter1997 commented 2 years ago

Hello, I think your understanding is correct. Note that we only make an analogy with the textual prompt in NLP, trying to understand object queries from another perspective. However, our model still requires full-network fine-tuning.

It would be interesting to fix the backbone, while only tuning the visual prompt-related parameters, as discussed in Sec 3.3, though our initial experiments show a poor model performance. However, recently visual prompt tuning and Vision Transformer Adapter made some further progress if you are interested.