Open hao416 opened 3 months ago
Hi @hao416 Sorry for the late reply. During training, we will train box prompt and point prompt in different iterations, i.e. they will not be used at the same time. 2. CLIP will add a [CLS] token to the input sentence as default and we can extract the feature of [CLS] token at the output of CLIP.
Thanks,dear author. But I have another question. Grounding dino contacts many labels as input sentence, so trex2 uses the way? But I see that you said trex2 uese Phrase in the github issues. If you use sentence, I don't know how to locate corresponding label embeddings from cls token, because its size is 1x 516. But if you use distinct phrases, negative labels can't play a role. I'm sorry for my poor English. I expect your reply. Thanks again!
---Original--- From: "Qing @.> Date: Sat, Aug 24, 2024 17:57 PM To: @.>; Cc: @.**@.>; Subject: Re: [IDEA-Research/T-Rex] About Visual Prompt Encoder andContrastive Alignment (Issue #85)
Hi @hao416 Sorry for the late reply. During training, we will train box prompt and point prompt in different iterations, i.e. they will not be used at the same time. 2. CLIP will add a [CLS] token to the input sentence as default and we can extract the feature of [CLS] token at the output of CLIP.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
Lets say we have four labels: a yellow dog, cat, person, a giant apple
. We will pass these four phrases or category names to CLIP for four times, and get their corresponding text embeddings. Here is a brief example:
a yellow dog [CLS] -> CLIP -> [CLS]
cat [CLS] -> CLIP -> [CLS]
dog [CLS] -> CLIP -> [CLS]
a giant apple [CLS] -> CLIP -> [CLS]
We concat these four text embeddings to get a tensor of shape 4XC
and use them for loss computation
oh,thank,I understand it correctly! Sorry, I also have a question. In the paper, you say that model randomly selects from 1 to n gt boxes as visual prompts. Now, if I set batch size as 2, img1 and img2. From img1,model gets 3 visual prompts. From img2, model gets 5 visual prompts. So I pad img1 3 prompts to 5 prompts for batch operation or use python for loop to operate it 2 times. Thanks
---Original--- From: "Qing @.> Date: Sat, Aug 24, 2024 18:26 PM To: @.>; Cc: @.**@.>; Subject: Re: [IDEA-Research/T-Rex] About Visual Prompt Encoder andContrastive Alignment (Issue #85)
Lets say we have four labels: a yellow dog, cat, person, a giant apple. We will pass these four phrases or category names to CLIP for four times, and get their corresponding text embeddings. Here is a brief example:
a yellow dog [CLS] -> CLIP -> [CLS] cat [CLS] -> CLIP -> [CLS] dog [CLS] -> CLIP -> [CLS] a giant apple [CLS] -> CLIP -> [CLS]
We concat these four text embeddings to get a tensor of shape 4XC and use them for loss computation
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
indeed, we need to pad image1 to 5 prompts
OK,thanks for your replies. All of you did a great job. Best wishes!
---Original--- From: "Qing @.> Date: Sat, Aug 24, 2024 18:47 PM To: @.>; Cc: @.**@.>; Subject: Re: [IDEA-Research/T-Rex] About Visual Prompt Encoder andContrastive Alignment (Issue #85)
indeed, we need to pad image1 to 5 prompts
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
Thanks,dear author, I'm sorry that I may have last two questions. Here, I give an example. I get 2 "cat" prompts and 3 "dog" prompts. Question 1: in visual prompt encoder, K means total number of visual prompts(5) or categories(2). I see that in contrastive loss, you said K means categories numbers in github issue. Question 2: visual prompts.will be used as weights in class predictions, so do I need to get a mean prompts of 2 cat prompts and a mean prompts of 3 dog prompts so that model makes sure get 2 class predictions. Thanks.
---Original--- From: "Qing @.> Date: Mon, Aug 26, 2024 10:28 AM To: @.>; Cc: @.**@.>; Subject: Re: [IDEA-Research/T-Rex] About Visual Prompt Encoder andContrastive Alignment (Issue #85)
Closed #85 as completed.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
Q1: K is the number of categories, and in your case K = 2. Q2: If the 2 cats or 3 dogs are from one image, they will be 'averaged' by taking the aggregator token as output. If they are from different images, they will be averaged by calculating the mean value
Ok, so you mean that if batch siz is 1, I only need to use aggregator token, namely a universal class token C' in your paper, as class prediction weighs. If batch size >1, I need to get every C' token and calculate mean values as final prediction weights. Right?
---Original--- From: "Qing @.> Date: Mon, Aug 26, 2024 11:10 AM To: @.>; Cc: @.**@.>; Subject: Re: [IDEA-Research/T-Rex] About Visual Prompt Encoder andContrastive Alignment (Issue #85)
Q1: K is the number of categories, and in your case K = 2. Q2: If the 2 cats or 3 dogs are from one image, they will be 'averaged' by taking the aggregator token as output. If they are from different images, they will be averaged by calculating the mean value
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
During the training process, we only need to use the aggregator, and this is independent of batch size. This is because, during training, we generate prompts only within the same image, meaning that the embeddings for objects like dogs and cats are used only within the current image. However, during inference, we can obtain an embedding from multiple images. For example, if we have two images, each with three dogs, we would first use the aggregator to extract the prompts for the three dogs in each image to obtain their respective embeddings. Then, we average the embeddings obtained from these two images to get the final embeddings.
Ok, thanks, author, I understand your reply, now I'm reproducing this work, so that I'm sorry I have many problems about details. Finally, I have some points unclear combined with your replies. I also give an example : 2 cat objects and 3 dog objects in an image. 1. In paper, you say " we randomly choose between one to all available GT boxes to use as.visual prompts". Now I suppose that I get 2 cat boxes and 2 dog boxes to generate visual prompts. In visual encoder, K means category numbers, so it means I need to sample a prompt randomly for each category(cat and dog) once again or all 4 prompts as inputs.
---Original--- From: "Qing @.> Date: Mon, Aug 26, 2024 11:32 AM To: @.>; Cc: @.**@.>; Subject: Re: [IDEA-Research/T-Rex] About Visual Prompt Encoder andContrastive Alignment (Issue #85)
During the training process, we only need to use the aggregator, and this is independent of batch size. This is because, during training, we generate prompts only within the same image, meaning that the embeddings for objects like dogs and cats are used only within the current image. However, during inference, we can obtain an embedding from multiple images. For example, if we have two images, each with three dogs, we would first use the aggregator to extract the prompts for the three dogs in each image to obtain their respective embeddings. Then, we average the embeddings obtained from these two images to get the final embeddings.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
OK, I read the paper and your replies again and I have understanded answer 1 and 2. Lastly, I want to make sure the form of the content embedding. In my codes, I set the content embedding = nn.Embedding(1, 256). 1. I get final vector from outputs after (msdeformattn->self attn ->ffn), namely query[:, -1, :] 2. I only copy (content embedding.weight) M times after (msdeformattn->self attn ->ffn). Thanks
Here is an example. Say there are three boxes selected to get the visual prompt embedding for dog. Then you will first broadcast the content embedding for three times, and concat it with the aggregator. This will get you a 4x256 tensor. Together with position embeddings, they will pass through deform -> self attn -> ffn. And lastly, the output at the aggregator position will be used as the final visual prompt embedding.
Ok, I got it. Thank you very much!!!
Dear author, I see that in grounding dino, it deals with category_id. Given an example, a image has two categories: cat and dog, and cat's id is 4 and dog's id is 5 in the dataset,. Grounding dino sorts them again from 0 so that cat->0, dog->1. Do you use the same way?
Dear author, I want to know how you train your model. In the table 6 of the paper, you train your model on these datasets one by one or contact them to a more large dataset.
Dear author, I want to know how you train your model. In the table 6 of the paper, you train your model on these datasets one by one or contact them to a more large dataset.
We concatenate those datasets into one for training.
Dear author, I see that in grounding dino, it deals with category_id. Given an example, a image has two categories: cat and dog, and cat's id is 4 and dog's id is 5 in the dataset,. Grounding dino sorts them again from 0 so that cat->0, dog->1. Do you use the same way?
We don't have a special process for the category id and we simply reuse the original id in its dataset.
Dear author, I see that in grounding dino, it deals with category_id. Given an example, a image has two categories: cat and dog, and cat's id is 4 and dog's id is 5 in the dataset,. Grounding dino sorts them again from 0 so that cat->0, dog->1. Do you use the same way?
We don't have a special process for the category id and we simply reuse the original id in its dataset.
Ok, thanks. I notice that you use denoising training in the paper which is associated with class_num and the original id in its dataset. You know, dino's label_enc=nn.embedding(dn_labelbook_size + 1, hidden_dim). Suppose I have 2 datasets A(10 categories) and B(20 categories), do you set label_enc=nn.embedding(30 + 1, hidden_dim)? And then, if in A, id 1 is person, and in B id 1 is table, how do you deal with it? Fuse 2 datasets and sort these categories from 0 to 29? Thanks
Since in the open-set task, we can not pre-assign ID to all the object categories in our datasets, so we do not compute the classification dn loss but only the box noise loss.
Since in the open-set task, we can not pre-assign ID to all the object categories in our datasets, so we do not compute the classification dn loss but only the box noise loss.
ok, thanks
sorry, I have another question. Features of [CLS] token in your text model is features of [EOS] token in original CLIP paper?
Yes. If your are using CLIP from huggingface, then you can get the [CLS] token like this:
model = CLIPTextModel.from_pretrained(pretrained_name)
outputs = model(**inputs)
pooled_feature = outputs.pooler_output
OK, thank you very much.You help me a lot.
---Original--- From: "Qing @.> Date: Wed, Sep 4, 2024 22:04 PM To: @.>; Cc: @.**@.>; Subject: Re: [IDEA-Research/T-Rex] About Visual Prompt Encoder andContrastive Alignment (Issue #85)
Yes. If your are using CLIP from huggingface, then you can get the [CLS] token like this: model = CLIPTextModel.from_pretrained(pretrained_name) outputs = model(**inputs) pooled_feature = outputs.pooler_output
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
Dear author, in visual prompt encoder, I define parameters of box and point prompts. I notice that you say you train box prompt and point prompt in different iterations, but I have problems with torch when I use multiple gpus. It shows that some parameters do not receive gradients. So my question is that I need to freeze some weights in different iterations? Thanks
Hi @hao416
There are two solutions. The first one is to set find_unused_parameters=True
. Here is an example
model = torch.nn.parallel.DistributedDataParallel(
model,
device_ids=[args.gpu],
find_unused_parameters=True)
The second one is to add the parameter of the unused module to the computation. Here is an example
box_embedding_layer = nn.Linear(4, 256)
point_embedding_layer = nn.Linear(2,256)
# for box iteration
embedding = box_embedding_layer(box)
for param in point_embedding_layer.parameters():
embedding = embedding+ param.sum() * 0.0
# for point iteration
embedding = point_embedding_layer(point)
for param in box_embedding_layer.parameters():
embedding = embedding+ param.sum() * 0.0
Thanks, I got it.And I search answers in the internet, I find that it can freeze weights in different iterations. Additionally, I ever tried the first solution but it can't work well.Thank you again.
---Original--- From: "Qing @.> Date: Thu, Sep 5, 2024 19:11 PM To: @.>; Cc: @.**@.>; Subject: Re: [IDEA-Research/T-Rex] About Visual Prompt Encoder andContrastive Alignment (Issue #85)
Hi @hao416 There are two solutions. The first one is to set find_unused_parameters=True. Here is an example model = torch.nn.parallel.DistributedDataParallel( model, device_ids=[args.gpu], find_unused_parameters=True)
The second one is to add the parameter of the unused module to the computation. Here is an example box_embedding_layer = nn.Linear(4, 256) point_embedding_layer = nn.Linear(2,256) # for box iteration embedding = box_embedding_layer(box) for param in point_embedding_layer.parameters(): embedding = embedding+ param.sum() 0.0 # for point iteration embedding = point_embedding_layer(point) for param in box_embedding_layer.parameters(): embedding = embedding+ param.sum() 0.0
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
def visual_prompt_cross_attention(self, support_feat,memory, query_mask_flatten): Q = self.content_embedding.weight[None, :]
Q = Q.expand(support_feat.shape[0], support_feat.shape[1], support_feat.shape[2])
Q_ = self.cross_attention_vp(self.with_pos_embed(Q.transpose(0,1),support_feat.transpose(0,1)), memory.transpose(0,1), memory.transpose(0,1),query_mask_flatten)[0].transpose(0,1)
Q = Q + self.cross_attention_vp_dropout(Q_)
Q = self.cross_attention_vp_norm(Q)
q = k = self.with_pos_embed(Q, support_feat)
Q_, _ = self.self_attn(q, k, value=Q, attn_mask=None)
Q = Q + self.dropout_post(Q_)
support_feat = self.norm_post(Q)
return support_feat
作者大大好,我照着你的结构复现了一部分内容,可以麻烦帮忙看看这个关于cross attention提取提示特征的函数写的对吗
def visual_prompt_cross_attention(self, support_feat,memory, query_mask_flatten): Q = self.content_embedding.weight[None, :] #expand to the same size as support_feat Q = Q.expand(support_feat.shape[0], support_feat.shape[1], supportfeat.shape[2]) Q = self.cross_attention_vp(self.with_pos_embed(Q.transpose(0,1),support_feat.transpose(0,1)), memory.transpose(0,1), memory.transpose(0,1),query_mask_flatten)[0].transpose(0,1) Q = Q + self.cross_attention_vpdropout(Q) Q = self.cross_attention_vp_norm(Q) q = k = self.with_pos_embed(Q, supportfeat) Q, _ = self.self_attn(q, k, value=Q, attn_mask=None) Q = Q + self.dropoutpost(Q) support_feat = self.norm_post(Q) return support_feat 作者大大好,我照着你的结构复现了一部分内容,可以麻烦帮忙看看这个关于cross attention提取提示特征的函数写的对吗
@CatfishW Sorry for the late reply. The implementation looks fine to me. As for detailed implementation, you can refer to this code: https://github.com/IDEA-Research/GroundingDINO/blob/main/groundingdino/models/GroundingDINO/transformer.py#L802
Dear author, I have a question that do you freeze weights of the text prompt encoder during training, thanks
Hi @hao416 We don't freeze the CLIP text encoder during training.
Hi @hao416 We don't freeze the CLIP text encoder during training.
OK,thanks for your reply, but I have two questiones:
We tried freezing the clip and fine-tuning the clip, and found that there was no particular difference between the two, and that fine-tuning performs a little better.
We are training with 8 iterations of visual prompts and then one iteration of text prompts instead of epoch.
2. s
OK, I misunderstood it before and I got it now. So I only need to choose specific prompts such as visual or text prompts to get final detection results at test time, right?
Yes. During inference you can you either text prompt or visual prompt
OK,thank you very much
---Original--- From: "Qing @.> Date: Fri, Oct 18, 2024 18:32 PM To: @.>; Cc: @.**@.>; Subject: Re: [IDEA-Research/T-Rex] About Visual Prompt Encoder andContrastive Alignment (Issue #85)
Yes. During inference you can you either text prompt or visual prompt
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
Yes. During inference you can you either text prompt or visual prompt
Dear author, I want to consult a question about training. I give an example to demonstrate it : You mentioned o365 /Goldg datesets for text prompt traing and o365/openimages for visual prompt training in paper. The question is that when train for text prompts in o365, the model has missed the first 8 iterations images for text prompt traing because they are trained for visual prompt traing. I want to know how to cope with different iterations in one forward process.
In dino framework, how to deal with it in for loop, namely : for samples, targets in metric_logger.log_every(data_loader, print_freq, header, logger=logger): xxxxxxxxxxxxxxxxx
thank you.
In the actual code, we define two data loaders: one for the text prompt, assumed to be text_loader, and another for the visual prompt, assumed to be visual_loader. After every 8 iterations of the visual_loader, we iterate once over the text_loader. The implementation can be done in the following way:
iter = 0
for visual_batch in visual_loader:
loss = model(visual_batch)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if iter % 8 == 0:
text_batch = next(text_loader)
loss = model(text_batch)
optimizer.zero_grad()
loss.backward()
optimizer.step()
iter += 1
In the actual code, we define two data loaders: one for the text prompt, assumed to be text_loader, and another for the visual prompt, assumed to be visual_loader. After every 8 iterations of the visual_loader, we iterate once over the text_loader. The implementation can be done in the following way:
iter = 0 for visual_batch in visual_loader: loss = model(visual_batch) optimizer.zero_grad() loss.backward() optimizer.step() if iter % 8 == 0: text_batch = next(text_loader) loss = model(text_batch) optimizer.zero_grad() loss.backward() optimizer.step() iter += 1
okok, thank you. But like o365 dateset, both for visual and text prompt training, it needs to be used in text_loader and visual_loader?
In the actual code, we define two data loaders: one for the text prompt, assumed to be text_loader, and another for the visual prompt, assumed to be visual_loader. After every 8 iterations of the visual_loader, we iterate once over the text_loader. The implementation can be done in the following way:
iter = 0 for visual_batch in visual_loader: loss = model(visual_batch) optimizer.zero_grad() loss.backward() optimizer.step() if iter % 8 == 0: text_batch = next(text_loader) loss = model(text_batch) optimizer.zero_grad() loss.backward() optimizer.step() iter += 1
okok, thank you. But like o365 dateset, both for visual and text prompt training, it needs to be used in text_loader and visual_loader?
And image numbers of text prompts are greater than that of visual prompts, how to make sure that all images are used for text prompts training when visual_batch for loop is end in your code template?
Hi, 作者大大 想请教您一下,For the deformable cross attention part, what should be the reference point for the [CLS] token?
In the actual code, we define two data loaders: one for the text prompt, assumed to be text_loader, and another for the visual prompt, assumed to be visual_loader. After every 8 iterations of the visual_loader, we iterate once over the text_loader. The implementation can be done in the following way:
iter = 0 for visual_batch in visual_loader: loss = model(visual_batch) optimizer.zero_grad() loss.backward() optimizer.step() if iter % 8 == 0: text_batch = next(text_loader) loss = model(text_batch) optimizer.zero_grad() loss.backward() optimizer.step() iter += 1
okok, thank you. But like o365 dateset, both for visual and text prompt training, it needs to be used in text_loader and visual_loader?
And image numbers of text prompts are greater than that of visual prompts, how to make sure that all images are used for text prompts training when visual_batch for loop is end in your code template?
You don't need to train all the text prompt data.
Hi, 作者大大 想请教您一下,For the deformable cross attention part, what should be the reference point for the [CLS] token?
For the aggregator token in the visual prompt encoder, we use a box of image size (i.e. [0.5, 0.5, 1, 1], normalized xywh format) as the position embedding
In the actual code, we define two data loaders: one for the text prompt, assumed to be text_loader, and another for the visual prompt, assumed to be visual_loader. After every 8 iterations of the visual_loader, we iterate once over the text_loader. The implementation can be done in the following way:
iter = 0 for visual_batch in visual_loader: loss = model(visual_batch) optimizer.zero_grad() loss.backward() optimizer.step() if iter % 8 == 0: text_batch = next(text_loader) loss = model(text_batch) optimizer.zero_grad() loss.backward() optimizer.step() iter += 1
okok, thank you. But like o365 dateset, both for visual and text prompt training, it needs to be used in text_loader and visual_loader?
And image numbers of text prompts are greater than that of visual prompts, how to make sure that all images are used for text prompts training when visual_batch for loop is end in your code template?
You don't need to train all the text prompt data. Ok, Thank you! I tried to train my model based on your strategy that I use 16H800 to train text prompt(3M images) first, but its map is only 0.6 after 2 epoches. I also used coco dataset to test this strategy, results are the same, but after 4 epoches, it has been improved, finally about 9.5 map(lr_drop is 6 ). In your paper, only text prompt without visual prompt, it can get 46.4 on coco. So I do not know whether this trend is normal? I really want to reproduce this model in my project, but I may encounter some problems.
We first train the model only on text prompt data to get empower the model with basic text prompt detection capability. Then we joint train the visual prompt along with the text prompt. Maybe you need to first train on text prompt only
We first train the model only on text prompt data to get empower the model with basic text prompt detection capability. Then we joint train the visual prompt along with the text prompt. Maybe you need to first train on text prompt only
ok, I know, I have got it from your previsous answers to other issues. Now , model is being trained with text prompt only. My question is how long or what level (such as mAP) it can meet the requirement of joint training. And another quetion, we can maintain a global dictionary to sample negative text prompts, but how to sample negative examples for visual prompt? I just cat visual aggregator embeddings from other images in current mini-batch. Thanks!
In our experiment, we start the joint training when the text prompt can reach 45mAP on COCO; For visual prompts, we can only sample negative prompts from the current mini batch.
In our experiment, we start the joint training when the text prompt can reach 45mAP on COCO; For visual prompts, we can only sample negative prompts from the current mini batch. OK, thank you very much! ! !
Hello, authors. I would like to ask two questiones. 1. How to deal with box query feature and point query feature after deformable cross- attention, contact? 2. How to get corresponding text prompts embedding from [CLS] token output, such as "cat", "dog"