IDEA-Research / T-Rex

[ECCV2024] API code for T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy
https://deepdataspace.com/home
Other
1.98k stars 120 forks source link

about multiple prompt #75

Open thfylsty opened 5 days ago

thfylsty commented 5 days ago

What is the multi-box prompt strategy?

Is it directly cal meaning of multiple prompts? I found that it might be directly calculating the meaning of multiple prompts. In the section "Generic Visual Prompt Workflow" of the paper, it states: Let V1, V2, ..., Vn, represent the visual embeddings obtained from n different images, the generic visual embeddings V are computed as the mean of these embeddings

Is it aggregator by Transformer? In the section "Visual Prompt Encoder" of the paper, it states: Q = Linear (CAT ([C; C′], [B; B′]); φB), box. Additionally, a universal class token C′ ∈ R^(1×D) is utilized to aggregate features from other visual prompts, accommodating the scenario where users might supply multiple visual prompts within a single image.

One method is aggregation and the other is meaning. Is meaning suitable for across images? Then what strategy is used when multiple prompts are used in an image?

I look forward to your reply, thank you.

Mountchicken commented 3 days ago

Hi @thfylsty Sorry for the late reply. Let's say we have two images A and B. A has three boxes for category "dog" and B has four boxes for category "dog". In A image, we will use transformer to aggregate the three dog instances and get one visual prompt embedding, denoted as dog1. In B image, we will also use transformer for aggregation, and denote the visual prompt embedding as dog2. Now if we want to have a general dog embedding as in the "Generic Visual Prompt Workflow" , we can use the average value (dog = (dog1 + dog2)/ 2) to represent it.