Closed fuweifu-vtoo closed 5 months ago
Hi @fuweifu-vtoo Each visual prompt embeddings can only come from on category
For instance, if we consider a batch size of 2:
For each category in the first image, we randomly select between 1 to (N) instances to form the visual prompt embeddings. Therefore, for the first image, we will have three visual prompt embeddings corresponding to categories A, B, and C.
Similarly, for the second image, we will have three visual prompt embeddings corresponding to categories D, E, and F.
In symbolic form:
For Image 1:
For Image 2:
Got it. Thanks.
@Mountchicken the embedings for each instances from the same class are average to obtain the final embedding?
@lyf6 If different instances of the same category are within a single image, we directly use the aggregator token for aggregation. If they are from different images, we calculate the average to obtain the final embedding.
thanks for your reply. I mean if there are 2 bbox for class A, 3 bbox for class B. how to obtain the final visual prompt embedding during training? randomly choose one or average them after self-attenation.
For class A and B, we will randomly select 1 to 2, and 1 to 3 boxes as their visual prompt, respectively. The selected boxes will go through the visual prompt encoder with multiple layers of self-attention and deformable-attention, and we use the last token (aggregator token) as the final visual prompt embedding.
i got it, thanks very much
i'm sorry i have another question. whether it's necessary to sample negative bbox added in the visual prompt during training.
Sampling negative examples can effectively mitigate the model's hallucination issue (i.e., the model not following your visual prompt and instead detecting more prominent areas in the image). Contrastive learning between positive and negative examples helps the model better distinguish the visual prompt.
so can you tell how you sample negative bbox?
In your paper, you mentioned: we randomly choose between one to all available GT boxes to use as visual prompts.
Could the visual prompts selected here be from different categories?
Or do the visual prompts of Trex2 have to come from the same category?