IDEA-Research / T-Rex

API for T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy
https://deepdataspace.com/home
Other
1.98k stars 120 forks source link

contrastive embedding #59

Closed lime-s closed 1 month ago

lime-s commented 1 month ago

In your paper, after extracting features from Visual Prompt Encoder and Image Encoder, model will compute the similarity between the encoder feature and the prompt embeddings. The prompt embeddings here is V that calculated in Visual Prompt Encoder? You wrote V = FFN (Selfattn (q ′)) [− 1] in the Visual Prompt Encoder section of the paper. Is it equal to C'+B '? And what is its size?

Mountchicken commented 1 month ago

The size for each visual prompt is torch.size(1, hidden_dim)