K set of InfoNCE Loss at Region-Level Contrastive Alignment

IDEA-Research / T-Rex

[ECCV2024] API code for T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy

https://deepdataspace.com/blog/T-Rex

Other

2.28k stars 147 forks source link

K set of InfoNCE Loss at Region-Level Contrastive Alignment #50

Closed urbaneman closed 7 months ago

urbaneman commented 7 months ago

What do you set K in L_algin?

Visual Prompts only choose once of every different categories? Is this correct?

Mountchicken commented 7 months ago

Hi @urbaneman K is the number of categories in each image. Given an image with three categories A, B, and C, then we will extract three visual prompt embeddings for each category.

urbaneman commented 7 months ago

I got. As I understand it. This paper seems to have a very large workload. THANKS for your great work.

lyf6 commented 3 months ago

when there is multiple gt bboxs for each category, select one randomly?

Mountchicken commented 3 months ago

@lyf6 For each category, we will randomly select 1 to N gt boxes as visual prompts. N = len(boxes)

lyf6 commented 3 months ago

thank you for your reply. when select 1 to N gt boxes as visual prompts, the infonce loss will be different from your paper version?