IDEA-Research / T-Rex

[ECCV2024] API code for T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy
https://deepdataspace.com/blog/T-Rex
Other
2.28k stars 147 forks source link

About the visual prompt #66

Closed fuweifu-vtoo closed 5 months ago

fuweifu-vtoo commented 5 months ago

In your paper, you mentioned: we randomly choose between one to all available GT boxes to use as visual prompts.

Could the visual prompts selected here be from different categories?

Or do the visual prompts of Trex2 have to come from the same category?

Mountchicken commented 5 months ago

Hi @fuweifu-vtoo Each visual prompt embeddings can only come from on category

For instance, if we consider a batch size of 2:

For each category in the first image, we randomly select between 1 to (N) instances to form the visual prompt embeddings. Therefore, for the first image, we will have three visual prompt embeddings corresponding to categories A, B, and C.

Similarly, for the second image, we will have three visual prompt embeddings corresponding to categories D, E, and F.

In symbolic form:

fuweifu-vtoo commented 5 months ago

Got it. Thanks.

lyf6 commented 2 months ago

@Mountchicken the embedings for each instances from the same class are average to obtain the final embedding?

Mountchicken commented 2 months ago

@lyf6 If different instances of the same category are within a single image, we directly use the aggregator token for aggregation. If they are from different images, we calculate the average to obtain the final embedding.

lyf6 commented 2 months ago

thanks for your reply. I mean if there are 2 bbox for class A, 3 bbox for class B. how to obtain the final visual prompt embedding during training? randomly choose one or average them after self-attenation.

Mountchicken commented 2 months ago

For class A and B, we will randomly select 1 to 2, and 1 to 3 boxes as their visual prompt, respectively. The selected boxes will go through the visual prompt encoder with multiple layers of self-attention and deformable-attention, and we use the last token (aggregator token) as the final visual prompt embedding.

lyf6 commented 2 months ago

i got it, thanks very much

lyf6 commented 2 months ago

i'm sorry i have another question. whether it's necessary to sample negative bbox added in the visual prompt during training.

Mountchicken commented 2 months ago

Sampling negative examples can effectively mitigate the model's hallucination issue (i.e., the model not following your visual prompt and instead detecting more prominent areas in the image). Contrastive learning between positive and negative examples helps the model better distinguish the visual prompt.

lyf6 commented 2 months ago

so can you tell how you sample negative bbox?