Closed thfylsty closed 3 months ago
Hi @thfylsty Sorry for the late reply. Let's say we have two images A and B. A has three boxes for category "dog" and B has four boxes for category "dog". In A image, we will use transformer to aggregate the three dog instances and get one visual prompt embedding, denoted as dog1. In B image, we will also use transformer for aggregation, and denote the visual prompt embedding as dog2. Now if we want to have a general dog embedding as in the "Generic Visual Prompt Workflow" , we can use the average value (dog = (dog1 + dog2)/ 2) to represent it.
@Mountchicken, could you, please, help me to understand how does this works when we train our model? Let suppose that we have 2 images in batch (as like your example, A and B). 1st (A) image contain 3 dogs and 2nd (B) image contain 4 dogs (both image also contain some cats). From image A we obtain embedding "dog1", from image B we obtain embedding "dog2".
Should we calculate mean "mean_dog" embedding, or we will calculate similarity between all embeddings from image's A objects and ["dog1", "dog2"]?
1) F.cosine_similarity(imA/imB objects embeddings, "mean_dog")
or
2) F.cosine_similarity(imA/imB objects embeddings, Tensor(["dog1", "dog2"]))
?
Thank you for your help!
What is the multi-box prompt strategy?
Is it directly cal meaning of multiple prompts? I found that it might be directly calculating the meaning of multiple prompts. In the section "Generic Visual Prompt Workflow" of the paper, it states: Let V1, V2, ..., Vn, represent the visual embeddings obtained from n different images, the generic visual embeddings V are computed as the mean of these embeddings
Is it aggregator by Transformer? In the section "Visual Prompt Encoder" of the paper, it states: Q = Linear (CAT ([C; C′], [B; B′]); φB), box. Additionally, a universal class token C′ ∈ R^(1×D) is utilized to aggregate features from other visual prompts, accommodating the scenario where users might supply multiple visual prompts within a single image.
One method is aggregation and the other is meaning. Is meaning suitable for across images? Then what strategy is used when multiple prompts are used in an image?
I look forward to your reply, thank you.