Open thfylsty opened 5 days ago
Hi @thfylsty Sorry for the late reply. Let's say we have two images A and B. A has three boxes for category "dog" and B has four boxes for category "dog". In A image, we will use transformer to aggregate the three dog instances and get one visual prompt embedding, denoted as dog1. In B image, we will also use transformer for aggregation, and denote the visual prompt embedding as dog2. Now if we want to have a general dog embedding as in the "Generic Visual Prompt Workflow" , we can use the average value (dog = (dog1 + dog2)/ 2) to represent it.
What is the multi-box prompt strategy?
Is it directly cal meaning of multiple prompts? I found that it might be directly calculating the meaning of multiple prompts. In the section "Generic Visual Prompt Workflow" of the paper, it states: Let V1, V2, ..., Vn, represent the visual embeddings obtained from n different images, the generic visual embeddings V are computed as the mean of these embeddings
Is it aggregator by Transformer? In the section "Visual Prompt Encoder" of the paper, it states: Q = Linear (CAT ([C; C′], [B; B′]); φB), box. Additionally, a universal class token C′ ∈ R^(1×D) is utilized to aggregate features from other visual prompts, accommodating the scenario where users might supply multiple visual prompts within a single image.
One method is aggregation and the other is meaning. Is meaning suitable for across images? Then what strategy is used when multiple prompts are used in an image?
I look forward to your reply, thank you.