Open nhw649 opened 11 months ago
You could find answers in Sec 4.2 and Table 8.
@Yuxuan-W The implementation is the same as the paper. And if you read codes carefully, you will find CAM is not used at every encoder layer.
@Yuxuan-W The implementation is the same as the paper. And if you read codes carefully, you will find CAM is not used at every encoder layer.
When computing category codes, src is support feature, why do src, category_code and tsp need to do QSAttn? It's not mentioned in the paper. Cannot be seen from fig. 4.
In deformable_transformer.py:
def forward_supp_branch(self, src, pos, reference_points, spatial_shapes, level_start_index, padding_mask, tsp, support_boxes):
# self attention
src2 = self.self_attn(self.with_pos_embed(src, pos), reference_points, src, spatial_shapes, level_start_index, padding_mask)
src = src + self.dropout1(src2)
src = self.norm1(src)
support_img_h, support_img_w = spatial_shapes[0, 0], spatial_shapes[0, 1]
supp_roi = torchvision.ops.roi_align(
src.transpose(1, 2).reshape(src.shape[0], -1, support_img_h, support_img_w),
support_boxes,
output_size=(7, 7),
spatial_scale=1 / 32.0,
aligned=True)
supp_roi=supp_roi.mean(3).mean(2) # [episode_size, C]
category_code = supp_roi.sigmoid() # [episode_size, C]
if self.QSAttn:
# siamese attention
src, tsp = self.siamese_attn(src,
inverse_sigmoid(category_code).unsqueeze(0).expand(src.shape[0], -1, -1),
category_code.unsqueeze(0).expand(src.shape[0], -1, -1),
tsp)
# ffn
src = self.forward_ffn(src + tsp)
else:
src = self.forward_ffn(src)
return src, category_code
@Yuxuan-W The implementation is the same as the paper. And if you read codes carefully, you will find CAM is not used at every encoder layer.
Sorry for my careless reading, the comment has been removed for clarification.
@Yuxuan-W No problem. Any discussions are welcome.
@nhw649 Do allow me some time to address your concern, as I have a full-time job now, and I haven't read my own paper for a long period of time.
@Yuxuan-W No problem. Any discussions are welcome.
@nhw649 Do allow me some time to address your concern, as I have a full-time job now, and I haven't read my own paper for a long period of time.
Okay, please. Thank you.
@Yuxuan-W The implementation is the same as the paper. And if you read codes carefully, you will find CAM is not used at every encoder layer.
When computing category codes, src is support feature, why do src, category_code and tsp need to do QSAttn? It's not mentioned in the paper. Cannot be seen from fig. 4.
In deformable_transformer.py:
def forward_supp_branch(self, src, pos, reference_points, spatial_shapes, level_start_index, padding_mask, tsp, support_boxes): # self attention src2 = self.self_attn(self.with_pos_embed(src, pos), reference_points, src, spatial_shapes, level_start_index, padding_mask) src = src + self.dropout1(src2) src = self.norm1(src) support_img_h, support_img_w = spatial_shapes[0, 0], spatial_shapes[0, 1] supp_roi = torchvision.ops.roi_align( src.transpose(1, 2).reshape(src.shape[0], -1, support_img_h, support_img_w), support_boxes, output_size=(7, 7), spatial_scale=1 / 32.0, aligned=True) supp_roi=supp_roi.mean(3).mean(2) # [episode_size, C] category_code = supp_roi.sigmoid() # [episode_size, C] if self.QSAttn: # siamese attention src, tsp = self.siamese_attn(src, inverse_sigmoid(category_code).unsqueeze(0).expand(src.shape[0], -1, -1), category_code.unsqueeze(0).expand(src.shape[0], -1, -1), tsp) # ffn src = self.forward_ffn(src + tsp) else: src = self.forward_ffn(src) return src, category_code
In the DeformableTransformerEncoder, you will find self.QSAttn == True is only for the first layer. However, if you look at the category code computed for the first layer, namely category_code[0], you will find it is not involved in the siamese_attn. In another word, only the category_code[1:] are effected by siamese_attn, but they are never used.
So I guess it's just for convenience in implementation. There's no problem and it is aligned with the paper.
I hope Yuxuan-W's comments have addressed your concerns. @nhw649 Thank you @Yuxuan-W !
(1) When computing category codes, src is support feature, why do src, category_code and tsp need to do QSAttn? It's not mentioned in the paper. (2) Why do QSAttn only on the first layer?
In deformable_transformer.py: