IDEA-Research / T-Rex

[ECCV2024] API code for T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy
https://deepdataspace.com/blog/T-Rex
Other
2.28k stars 147 forks source link

about multiple prompt #75

Closed thfylsty closed 3 months ago

thfylsty commented 4 months ago

What is the multi-box prompt strategy?

Is it directly cal meaning of multiple prompts? I found that it might be directly calculating the meaning of multiple prompts. In the section "Generic Visual Prompt Workflow" of the paper, it states: Let V1, V2, ..., Vn, represent the visual embeddings obtained from n different images, the generic visual embeddings V are computed as the mean of these embeddings

Is it aggregator by Transformer? In the section "Visual Prompt Encoder" of the paper, it states: Q = Linear (CAT ([C; C′], [B; B′]); φB), box. Additionally, a universal class token C′ ∈ R^(1×D) is utilized to aggregate features from other visual prompts, accommodating the scenario where users might supply multiple visual prompts within a single image.

One method is aggregation and the other is meaning. Is meaning suitable for across images? Then what strategy is used when multiple prompts are used in an image?

I look forward to your reply, thank you.

Mountchicken commented 4 months ago

Hi @thfylsty Sorry for the late reply. Let's say we have two images A and B. A has three boxes for category "dog" and B has four boxes for category "dog". In A image, we will use transformer to aggregate the three dog instances and get one visual prompt embedding, denoted as dog1. In B image, we will also use transformer for aggregation, and denote the visual prompt embedding as dog2. Now if we want to have a general dog embedding as in the "Generic Visual Prompt Workflow" , we can use the average value (dog = (dog1 + dog2)/ 2) to represent it.

VilisovEvgeny commented 3 months ago

@Mountchicken, could you, please, help me to understand how does this works when we train our model? Let suppose that we have 2 images in batch (as like your example, A and B). 1st (A) image contain 3 dogs and 2nd (B) image contain 4 dogs (both image also contain some cats). From image A we obtain embedding "dog1", from image B we obtain embedding "dog2".

Should we calculate mean "mean_dog" embedding, or we will calculate similarity between all embeddings from image's A objects and ["dog1", "dog2"]?

1) F.cosine_similarity(imA/imB objects embeddings, "mean_dog") or 2) F.cosine_similarity(imA/imB objects embeddings, Tensor(["dog1", "dog2"]))

?

Thank you for your help!