About Visual Prompt Encoder and Contrastive Alignment

hao416 commented 3 months ago

Hello, authors. I would like to ask two questiones. 1. How to deal with box query feature and point query feature after deformable cross- attention, contact? 2. How to get corresponding text prompts embedding from [CLS] token output, such as "cat", "dog"

hao416 commented 3 weeks ago

@Mountchicken Hi ,dear author, I'm sorry for I have a question for you. I have trained the model with text prompts only nearly 4 days, almost 4 entire epoches on O365, GoldG and Bamboo datasets. But zero shot mAP is only 5.3 on coco. The convegence is very slow. Is it normal? I notice that you say you only nearly 3 days on 8xA100. The model is composed of image encoder, clip-b text encoders, query selection layer proposed in Grounding DINO, and other dino components which is the same with dino. The loss includes cls + L1 + GIoU + DN(box), cls = contrastive loss. So do you use other approaches or details to train text prompt? I'm looking forward to your reply.Thanks!

hao416 commented 3 weeks ago

It seems that it takes a very long time to converge without interations or fusion between text and image features like fusion modules in Grounding DINO or other models.

IDEA-Research / T-Rex

About Visual Prompt Encoder and Contrastive Alignment #85