IDEA-Research / T-Rex

[ECCV2024] API code for T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy
https://deepdataspace.com/blog/T-Rex
Other
2.28k stars 147 forks source link

About Visual Prompt Encoder and Contrastive Alignment #85

Open hao416 opened 3 months ago

hao416 commented 3 months ago

Hello, authors. I would like to ask two questiones. 1. How to deal with box query feature and point query feature after deformable cross- attention, contact? 2. How to get corresponding text prompts embedding from [CLS] token output, such as "cat", "dog"

Mountchicken commented 3 months ago

Hi @hao416 Sorry for the late reply. During training, we will train box prompt and point prompt in different iterations, i.e. they will not be used at the same time. 2. CLIP will add a [CLS] token to the input sentence as default and we can extract the feature of [CLS] token at the output of CLIP.

hao416 commented 3 months ago

Thanks,dear author. But I have another question. Grounding dino contacts many labels as input sentence,  so trex2 uses the way? But I see that you said trex2 uese Phrase in the github issues. If you use sentence,  I don't know how to locate corresponding label embeddings from cls token, because its size is 1x 516. But if you use distinct phrases,  negative labels can't play a role. I'm sorry for my poor English. I expect your reply. Thanks again!

---Original--- From: "Qing @.> Date: Sat, Aug 24, 2024 17:57 PM To: @.>; Cc: @.**@.>; Subject: Re: [IDEA-Research/T-Rex] About Visual Prompt Encoder andContrastive Alignment (Issue #85)

Hi @hao416 Sorry for the late reply. During training, we will train box prompt and point prompt in different iterations, i.e. they will not be used at the same time. 2. CLIP will add a [CLS] token to the input sentence as default and we can extract the feature of [CLS] token at the output of CLIP.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

Mountchicken commented 3 months ago

Lets say we have four labels: a yellow dog, cat, person, a giant apple. We will pass these four phrases or category names to CLIP for four times, and get their corresponding text embeddings. Here is a brief example:

a yellow dog [CLS] -> CLIP -> [CLS]
cat [CLS] -> CLIP -> [CLS]
dog [CLS] -> CLIP -> [CLS]
a giant apple [CLS] -> CLIP -> [CLS]

We concat these four text embeddings to get a tensor of shape 4XC and use them for loss computation

hao416 commented 3 months ago

oh,thank,I understand it correctly! Sorry, I also have a question. In the paper, you say that model randomly selects from 1 to n gt boxes as visual prompts. Now, if I set batch size as 2, img1 and img2. From img1,model gets 3 visual prompts. From img2, model gets 5 visual prompts. So I pad img1 3 prompts to 5 prompts for batch operation or use  python for loop to operate it 2 times. Thanks

---Original--- From: "Qing @.> Date: Sat, Aug 24, 2024 18:26 PM To: @.>; Cc: @.**@.>; Subject: Re: [IDEA-Research/T-Rex] About Visual Prompt Encoder andContrastive Alignment (Issue #85)

Lets say we have four labels: a yellow dog, cat, person, a giant apple. We will pass these four phrases or category names to CLIP for four times, and get their corresponding text embeddings. Here is a brief example: a yellow dog [CLS] -> CLIP -> [CLS] cat [CLS] -> CLIP -> [CLS] dog [CLS] -> CLIP -> [CLS] a giant apple [CLS] -> CLIP -> [CLS]
We concat these four text embeddings to get a tensor of shape 4XC and use them for loss computation

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

Mountchicken commented 3 months ago

indeed, we need to pad image1 to 5 prompts

hao416 commented 3 months ago

OK,thanks for your replies. All of you did a great job. Best wishes!

---Original--- From: "Qing @.> Date: Sat, Aug 24, 2024 18:47 PM To: @.>; Cc: @.**@.>; Subject: Re: [IDEA-Research/T-Rex] About Visual Prompt Encoder andContrastive Alignment (Issue #85)

indeed, we need to pad image1 to 5 prompts

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

hao416 commented 3 months ago

Thanks,dear author, I'm sorry that I may have last two questions. Here, I give an example. I get 2 "cat" prompts and 3 "dog" prompts. Question 1: in visual prompt encoder, K means total number of visual prompts(5) or categories(2). I see that in contrastive  loss, you said K means categories numbers in github issue. Question 2: visual prompts.will be used as weights in class predictions, so do I need to get a mean prompts of 2 cat prompts and a mean prompts of 3 dog prompts so that model makes sure get 2 class predictions. Thanks.

---Original--- From: "Qing @.> Date: Mon, Aug 26, 2024 10:28 AM To: @.>; Cc: @.**@.>; Subject: Re: [IDEA-Research/T-Rex] About Visual Prompt Encoder andContrastive Alignment (Issue #85)

Closed #85 as completed.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

Mountchicken commented 3 months ago

Q1: K is the number of categories, and in your case K = 2. Q2: If the 2 cats or 3 dogs are from one image, they will be 'averaged' by taking the aggregator token as output. If they are from different images, they will be averaged by calculating the mean value

hao416 commented 3 months ago

Ok, so you mean that if batch siz is 1, I only need to use aggregator token, namely a universal class token C' in your paper, as class prediction weighs. If batch size >1, I need to get every C' token and calculate mean values as final prediction weights. Right?

---Original--- From: "Qing @.> Date: Mon, Aug 26, 2024 11:10 AM To: @.>; Cc: @.**@.>; Subject: Re: [IDEA-Research/T-Rex] About Visual Prompt Encoder andContrastive Alignment (Issue #85)

Q1: K is the number of categories, and in your case K = 2. Q2: If the 2 cats or 3 dogs are from one image, they will be 'averaged' by taking the aggregator token as output. If they are from different images, they will be averaged by calculating the mean value

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

Mountchicken commented 3 months ago

During the training process, we only need to use the aggregator, and this is independent of batch size. This is because, during training, we generate prompts only within the same image, meaning that the embeddings for objects like dogs and cats are used only within the current image. However, during inference, we can obtain an embedding from multiple images. For example, if we have two images, each with three dogs, we would first use the aggregator to extract the prompts for the three dogs in each image to obtain their respective embeddings. Then, we average the embeddings obtained from these two images to get the final embeddings.

hao416 commented 3 months ago

Ok, thanks, author, I understand your reply, now I'm reproducing this work, so that I'm sorry I have many problems about details. Finally, I have some points unclear combined with your replies. I also give an example : 2 cat objects and 3 dog objects in an image.   1. In paper, you say " we randomly choose between one to all available GT boxes to use as.visual prompts". Now I suppose that I get 2 cat boxes and 2 dog boxes to generate visual prompts. In visual encoder,  K means category numbers, so it means I need to sample a prompt randomly for each category(cat and dog) once again or all 4 prompts as inputs. 

  1. K is a fixed hyperparameter? 
  2. The learnable content embedding is broadcasted K times to KxD. I can't understand this "broadcast" clearly, it means original dimension of content embedding is 1xD?

---Original--- From: "Qing @.> Date: Mon, Aug 26, 2024 11:32 AM To: @.>; Cc: @.**@.>; Subject: Re: [IDEA-Research/T-Rex] About Visual Prompt Encoder andContrastive Alignment (Issue #85)

During the training process, we only need to use the aggregator, and this is independent of batch size. This is because, during training, we generate prompts only within the same image, meaning that the embeddings for objects like dogs and cats are used only within the current image. However, during inference, we can obtain an embedding from multiple images. For example, if we have two images, each with three dogs, we would first use the aggregator to extract the prompts for the three dogs in each image to obtain their respective embeddings. Then, we average the embeddings obtained from these two images to get the final embeddings.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

Mountchicken commented 2 months ago
  1. Given an image, if there are M categories, we will finally get M visual prompt embeddings for each category.
  2. K is not a hyperparameter. It's the number of categories in current image. If you are using batch training, K will be the largest number of categories in this batch
  3. The content embedding is in 1XD dimension and will be copied for M times to get a MXD tensor.
hao416 commented 2 months ago

OK, I read the paper and your replies again and I have understanded answer 1 and 2. Lastly, I want to make sure the form of the content embedding. In my codes, I set the content embedding = nn.Embedding(1, 256). 1. I get final vector from outputs after (msdeformattn->self attn ->ffn), namely query[:, -1, :] 2. I only copy (content embedding.weight) M times after (msdeformattn->self attn ->ffn). Thanks

Mountchicken commented 2 months ago

Here is an example. Say there are three boxes selected to get the visual prompt embedding for dog. Then you will first broadcast the content embedding for three times, and concat it with the aggregator. This will get you a 4x256 tensor. Together with position embeddings, they will pass through deform -> self attn -> ffn. And lastly, the output at the aggregator position will be used as the final visual prompt embedding.

hao416 commented 2 months ago

Ok, I got it. Thank you very much!!!

hao416 commented 2 months ago

Dear author, I see that in grounding dino, it deals with category_id. Given an example, a image has two categories: cat and dog, and cat's id is 4 and dog's id is 5 in the dataset,. Grounding dino sorts them again from 0 so that cat->0, dog->1. Do you use the same way?

hao416 commented 2 months ago

Dear author, I want to know how you train your model. In the table 6 of the paper, you train your model on these datasets one by one or contact them to a more large dataset.

Mountchicken commented 2 months ago

Dear author, I want to know how you train your model. In the table 6 of the paper, you train your model on these datasets one by one or contact them to a more large dataset.

We concatenate those datasets into one for training.

Mountchicken commented 2 months ago

Dear author, I see that in grounding dino, it deals with category_id. Given an example, a image has two categories: cat and dog, and cat's id is 4 and dog's id is 5 in the dataset,. Grounding dino sorts them again from 0 so that cat->0, dog->1. Do you use the same way?

We don't have a special process for the category id and we simply reuse the original id in its dataset.

hao416 commented 2 months ago

Dear author, I see that in grounding dino, it deals with category_id. Given an example, a image has two categories: cat and dog, and cat's id is 4 and dog's id is 5 in the dataset,. Grounding dino sorts them again from 0 so that cat->0, dog->1. Do you use the same way?

We don't have a special process for the category id and we simply reuse the original id in its dataset.

Ok, thanks. I notice that you use denoising training in the paper which is associated with class_num and the original id in its dataset. You know, dino's label_enc=nn.embedding(dn_labelbook_size + 1, hidden_dim). Suppose I have 2 datasets A(10 categories) and B(20 categories), do you set label_enc=nn.embedding(30 + 1, hidden_dim)? And then, if in A, id 1 is person, and in B id 1 is table, how do you deal with it? Fuse 2 datasets and sort these categories from 0 to 29? Thanks

Mountchicken commented 2 months ago

Since in the open-set task, we can not pre-assign ID to all the object categories in our datasets, so we do not compute the classification dn loss but only the box noise loss.

hao416 commented 2 months ago

Since in the open-set task, we can not pre-assign ID to all the object categories in our datasets, so we do not compute the classification dn loss but only the box noise loss.

ok, thanks

hao416 commented 2 months ago

sorry, I have another question. Features of [CLS] token in your text model is features of [EOS] token in original CLIP paper?

Mountchicken commented 2 months ago

Yes. If your are using CLIP from huggingface, then you can get the [CLS] token like this:

model = CLIPTextModel.from_pretrained(pretrained_name)
outputs = model(**inputs)
pooled_feature = outputs.pooler_output
hao416 commented 2 months ago

OK, thank you very much.You help me a lot.

---Original--- From: "Qing @.> Date: Wed, Sep 4, 2024 22:04 PM To: @.>; Cc: @.**@.>; Subject: Re: [IDEA-Research/T-Rex] About Visual Prompt Encoder andContrastive Alignment (Issue #85)

Yes. If your are using CLIP from huggingface, then you can get the [CLS] token like this: model = CLIPTextModel.from_pretrained(pretrained_name) outputs = model(**inputs) pooled_feature = outputs.pooler_output

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

hao416 commented 2 months ago

Dear author, in visual prompt encoder, I define parameters of box and point prompts. I notice that you say you train box prompt and point prompt in different iterations, but I have problems with torch when I use multiple gpus. It shows that some parameters do not receive gradients. So my question is that I need to freeze some weights in different iterations? Thanks

Mountchicken commented 2 months ago

Hi @hao416 There are two solutions. The first one is to set find_unused_parameters=True. Here is an example

model = torch.nn.parallel.DistributedDataParallel(
            model,
            device_ids=[args.gpu],
            find_unused_parameters=True)

The second one is to add the parameter of the unused module to the computation. Here is an example

box_embedding_layer = nn.Linear(4, 256)
point_embedding_layer = nn.Linear(2,256)
# for box iteration
embedding = box_embedding_layer(box)
for param in point_embedding_layer.parameters():
     embedding = embedding+ param.sum() * 0.0
# for point iteration
embedding = point_embedding_layer(point)
for param in box_embedding_layer.parameters():
     embedding = embedding+ param.sum() * 0.0
hao416 commented 2 months ago

Thanks, I got it.And I search answers in the internet, I find that it can freeze weights in different iterations. Additionally, I ever tried  the first solution but it can't work well.Thank you again.

---Original--- From: "Qing @.> Date: Thu, Sep 5, 2024 19:11 PM To: @.>; Cc: @.**@.>; Subject: Re: [IDEA-Research/T-Rex] About Visual Prompt Encoder andContrastive Alignment (Issue #85)

Hi @hao416 There are two solutions. The first one is to set find_unused_parameters=True. Here is an example model = torch.nn.parallel.DistributedDataParallel( model, device_ids=[args.gpu], find_unused_parameters=True)

The second one is to add the parameter of the unused module to the computation. Here is an example box_embedding_layer = nn.Linear(4, 256) point_embedding_layer = nn.Linear(2,256) # for box iteration embedding = box_embedding_layer(box) for param in point_embedding_layer.parameters(): embedding = embedding+ param.sum() 0.0 # for point iteration embedding = point_embedding_layer(point) for param in box_embedding_layer.parameters(): embedding = embedding+ param.sum() 0.0

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

CatfishW commented 2 months ago

def visual_prompt_cross_attention(self, support_feat,memory, query_mask_flatten): Q = self.content_embedding.weight[None, :]

expand to the same size as support_feat

    Q = Q.expand(support_feat.shape[0], support_feat.shape[1], support_feat.shape[2])
    Q_ = self.cross_attention_vp(self.with_pos_embed(Q.transpose(0,1),support_feat.transpose(0,1)), memory.transpose(0,1), memory.transpose(0,1),query_mask_flatten)[0].transpose(0,1)
    Q = Q + self.cross_attention_vp_dropout(Q_)
    Q = self.cross_attention_vp_norm(Q)
    q = k = self.with_pos_embed(Q, support_feat)
    Q_, _ = self.self_attn(q, k, value=Q, attn_mask=None)
    Q = Q + self.dropout_post(Q_)
    support_feat = self.norm_post(Q)
    return support_feat
作者大大好,我照着你的结构复现了一部分内容,可以麻烦帮忙看看这个关于cross attention提取提示特征的函数写的对吗
CatfishW commented 2 months ago

def visual_prompt_cross_attention(self, support_feat,memory, query_mask_flatten): Q = self.content_embedding.weight[None, :] #expand to the same size as support_feat Q = Q.expand(support_feat.shape[0], support_feat.shape[1], supportfeat.shape[2]) Q = self.cross_attention_vp(self.with_pos_embed(Q.transpose(0,1),support_feat.transpose(0,1)), memory.transpose(0,1), memory.transpose(0,1),query_mask_flatten)[0].transpose(0,1) Q = Q + self.cross_attention_vpdropout(Q) Q = self.cross_attention_vp_norm(Q) q = k = self.with_pos_embed(Q, supportfeat) Q, _ = self.self_attn(q, k, value=Q, attn_mask=None) Q = Q + self.dropoutpost(Q) support_feat = self.norm_post(Q) return support_feat 作者大大好,我照着你的结构复现了一部分内容,可以麻烦帮忙看看这个关于cross attention提取提示特征的函数写的对吗

image

Mountchicken commented 2 months ago

@CatfishW Sorry for the late reply. The implementation looks fine to me. As for detailed implementation, you can refer to this code: https://github.com/IDEA-Research/GroundingDINO/blob/main/groundingdino/models/GroundingDINO/transformer.py#L802

hao416 commented 1 month ago

Dear author, I have a question that do you freeze weights of the text prompt encoder during training, thanks

Mountchicken commented 1 month ago

Hi @hao416 We don't freeze the CLIP text encoder during training.

hao416 commented 1 month ago

Hi @hao416 We don't freeze the CLIP text encoder during training.

OK,thanks for your reply, but I have two questiones:

  1. Recent works show that if models do not freeze CLIP text encoder, it may perturbmodel weights and interfere final performance. So do you study corresponding impacts?
  2. I notice that you ever mentioned 8 epoches for visual prompt and 1 epoch for text prompt. Now I try to reproduce your model but I change this setting by 4 epoches for visual prompt and 1 epoch for text prompt limited by the number of GPU devices. I find that results of text prompts can not be improved based on that of visual prompt. For example, I assume mAP is 18.0 after 4 epoches with visual prompts, but mAP may be 11.0 after first epoch with text prompts, namely 5th epoch in total numbers. It seems like that the whole model is trained from scratch. Is this normal?How many epoches you use? Thanks
Mountchicken commented 1 month ago
  1. We tried freezing the clip and fine-tuning the clip, and found that there was no particular difference between the two, and that fine-tuning performs a little better.

  2. We are training with 8 iterations of visual prompts and then one iteration of text prompts instead of epoch.

hao416 commented 1 month ago

2. s

OK, I misunderstood it before and I got it now. So I only need to choose specific prompts such as visual or text prompts to get final detection results at test time, right?

Mountchicken commented 1 month ago

Yes. During inference you can you either text prompt or visual prompt

hao416 commented 1 month ago

OK,thank you very much

---Original--- From: "Qing @.> Date: Fri, Oct 18, 2024 18:32 PM To: @.>; Cc: @.**@.>; Subject: Re: [IDEA-Research/T-Rex] About Visual Prompt Encoder andContrastive Alignment (Issue #85)

Yes. During inference you can you either text prompt or visual prompt

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

hao416 commented 1 month ago

Yes. During inference you can you either text prompt or visual prompt

Dear author, I want to consult a question about training. I give an example to demonstrate it : You mentioned o365 /Goldg datesets for text prompt traing and o365/openimages for visual prompt training in paper. The question is that when train for text prompts in o365, the model has missed the first 8 iterations images for text prompt traing because they are trained for visual prompt traing. I want to know how to cope with different iterations in one forward process.

In dino framework, how to deal with it in for loop, namely : for samples, targets in metric_logger.log_every(data_loader, print_freq, header, logger=logger): xxxxxxxxxxxxxxxxx

thank you.

Mountchicken commented 1 month ago

In the actual code, we define two data loaders: one for the text prompt, assumed to be text_loader, and another for the visual prompt, assumed to be visual_loader. After every 8 iterations of the visual_loader, we iterate once over the text_loader. The implementation can be done in the following way:

iter = 0
for visual_batch in visual_loader:
      loss = model(visual_batch)
      optimizer.zero_grad()
      loss.backward()
      optimizer.step()

      if iter % 8 == 0:
          text_batch = next(text_loader)
          loss = model(text_batch)
          optimizer.zero_grad()
          loss.backward()
          optimizer.step()
      iter += 1
hao416 commented 1 month ago

In the actual code, we define two data loaders: one for the text prompt, assumed to be text_loader, and another for the visual prompt, assumed to be visual_loader. After every 8 iterations of the visual_loader, we iterate once over the text_loader. The implementation can be done in the following way:

iter = 0
for visual_batch in visual_loader:
      loss = model(visual_batch)
      optimizer.zero_grad()
      loss.backward()
      optimizer.step()

      if iter % 8 == 0:
          text_batch = next(text_loader)
          loss = model(text_batch)
          optimizer.zero_grad()
          loss.backward()
          optimizer.step()
      iter += 1

okok, thank you. But like o365 dateset, both for visual and text prompt training, it needs to be used in text_loader and visual_loader?

hao416 commented 1 month ago

In the actual code, we define two data loaders: one for the text prompt, assumed to be text_loader, and another for the visual prompt, assumed to be visual_loader. After every 8 iterations of the visual_loader, we iterate once over the text_loader. The implementation can be done in the following way:

iter = 0
for visual_batch in visual_loader:
      loss = model(visual_batch)
      optimizer.zero_grad()
      loss.backward()
      optimizer.step()

      if iter % 8 == 0:
          text_batch = next(text_loader)
          loss = model(text_batch)
          optimizer.zero_grad()
          loss.backward()
          optimizer.step()
      iter += 1

okok, thank you. But like o365 dateset, both for visual and text prompt training, it needs to be used in text_loader and visual_loader?

And image numbers of text prompts are greater than that of visual prompts, how to make sure that all images are used for text prompts training when visual_batch for loop is end in your code template?

CatfishW commented 1 month ago

Hi, 作者大大 image 想请教您一下,For the deformable cross attention part, what should be the reference point for the [CLS] token?

Mountchicken commented 3 weeks ago

In the actual code, we define two data loaders: one for the text prompt, assumed to be text_loader, and another for the visual prompt, assumed to be visual_loader. After every 8 iterations of the visual_loader, we iterate once over the text_loader. The implementation can be done in the following way:

iter = 0
for visual_batch in visual_loader:
      loss = model(visual_batch)
      optimizer.zero_grad()
      loss.backward()
      optimizer.step()

      if iter % 8 == 0:
          text_batch = next(text_loader)
          loss = model(text_batch)
          optimizer.zero_grad()
          loss.backward()
          optimizer.step()
      iter += 1

okok, thank you. But like o365 dateset, both for visual and text prompt training, it needs to be used in text_loader and visual_loader?

And image numbers of text prompts are greater than that of visual prompts, how to make sure that all images are used for text prompts training when visual_batch for loop is end in your code template?

You don't need to train all the text prompt data.

Mountchicken commented 3 weeks ago

Hi, 作者大大 image 想请教您一下,For the deformable cross attention part, what should be the reference point for the [CLS] token?

For the aggregator token in the visual prompt encoder, we use a box of image size (i.e. [0.5, 0.5, 1, 1], normalized xywh format) as the position embedding

hao416 commented 3 weeks ago

In the actual code, we define two data loaders: one for the text prompt, assumed to be text_loader, and another for the visual prompt, assumed to be visual_loader. After every 8 iterations of the visual_loader, we iterate once over the text_loader. The implementation can be done in the following way:

iter = 0
for visual_batch in visual_loader:
      loss = model(visual_batch)
      optimizer.zero_grad()
      loss.backward()
      optimizer.step()

      if iter % 8 == 0:
          text_batch = next(text_loader)
          loss = model(text_batch)
          optimizer.zero_grad()
          loss.backward()
          optimizer.step()
      iter += 1

okok, thank you. But like o365 dateset, both for visual and text prompt training, it needs to be used in text_loader and visual_loader?

And image numbers of text prompts are greater than that of visual prompts, how to make sure that all images are used for text prompts training when visual_batch for loop is end in your code template?

You don't need to train all the text prompt data. Ok, Thank you! I tried to train my model based on your strategy that I use 16H800 to train text prompt(3M images) first, but its map is only 0.6 after 2 epoches. I also used coco dataset to test this strategy, results are the same, but after 4 epoches, it has been improved, finally about 9.5 map(lr_drop is 6 ). In your paper, only text prompt without visual prompt, it can get 46.4 on coco. So I do not know whether this trend is normal? I really want to reproduce this model in my project, but I may encounter some problems.

Mountchicken commented 3 weeks ago

We first train the model only on text prompt data to get empower the model with basic text prompt detection capability. Then we joint train the visual prompt along with the text prompt. Maybe you need to first train on text prompt only

hao416 commented 3 weeks ago

We first train the model only on text prompt data to get empower the model with basic text prompt detection capability. Then we joint train the visual prompt along with the text prompt. Maybe you need to first train on text prompt only

ok, I know, I have got it from your previsous answers to other issues. Now , model is being trained with text prompt only. My question is how long or what level (such as mAP) it can meet the requirement of joint training. And another quetion, we can maintain a global dictionary to sample negative text prompts, but how to sample negative examples for visual prompt? I just cat visual aggregator embeddings from other images in current mini-batch. Thanks!

Mountchicken commented 3 weeks ago

In our experiment, we start the joint training when the text prompt can reach 45mAP on COCO; For visual prompts, we can only sample negative prompts from the current mini batch.

hao416 commented 3 weeks ago

In our experiment, we start the joint training when the text prompt can reach 45mAP on COCO; For visual prompts, we can only sample negative prompts from the current mini batch. OK, thank you very much! ! !