IDEA-Research / DINO

[ICLR 2023] Official implementation of the paper "DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection"
Apache License 2.0
2.15k stars 232 forks source link

Model.named_parameters() #164

Open anjugopinath opened 1 year ago

anjugopinath commented 1 year ago

I used this code to print all the named parameters: for name, module in model.named_parameters(): print(name)

And this is the output:

transformer.level_embed transformer.encoder.layers.0.self_attn.sampling_offsets.weight transformer.encoder.layers.0.self_attn.sampling_offsets.bias transformer.encoder.layers.0.self_attn.attention_weights.weight transformer.encoder.layers.0.self_attn.attention_weights.bias transformer.encoder.layers.0.self_attn.value_proj.weight transformer.encoder.layers.0.self_attn.value_proj.bias transformer.encoder.layers.0.self_attn.output_proj.weight transformer.encoder.layers.0.self_attn.output_proj.bias transformer.encoder.layers.0.norm1.weight transformer.encoder.layers.0.norm1.bias transformer.encoder.layers.0.linear1.weight transformer.encoder.layers.0.linear1.bias transformer.encoder.layers.0.linear2.weight transformer.encoder.layers.0.linear2.bias transformer.encoder.layers.0.norm2.weight transformer.encoder.layers.0.norm2.bias transformer.encoder.layers.1.self_attn.sampling_offsets.weight transformer.encoder.layers.1.self_attn.sampling_offsets.bias transformer.encoder.layers.1.self_attn.attention_weights.weight transformer.encoder.layers.1.self_attn.attention_weights.bias transformer.encoder.layers.1.self_attn.value_proj.weight transformer.encoder.layers.1.self_attn.value_proj.bias transformer.encoder.layers.1.self_attn.output_proj.weight transformer.encoder.layers.1.self_attn.output_proj.bias transformer.encoder.layers.1.norm1.weight transformer.encoder.layers.1.norm1.bias transformer.encoder.layers.1.linear1.weight transformer.encoder.layers.1.linear1.bias transformer.encoder.layers.1.linear2.weight transformer.encoder.layers.1.linear2.bias transformer.encoder.layers.1.norm2.weight transformer.encoder.layers.1.norm2.bias transformer.encoder.layers.2.self_attn.sampling_offsets.weight transformer.encoder.layers.2.self_attn.sampling_offsets.bias transformer.encoder.layers.2.self_attn.attention_weights.weight transformer.encoder.layers.2.self_attn.attention_weights.bias transformer.encoder.layers.2.self_attn.value_proj.weight transformer.encoder.layers.2.self_attn.value_proj.bias transformer.encoder.layers.2.self_attn.output_proj.weight transformer.encoder.layers.2.self_attn.output_proj.bias transformer.encoder.layers.2.norm1.weight transformer.encoder.layers.2.norm1.bias transformer.encoder.layers.2.linear1.weight transformer.encoder.layers.2.linear1.bias transformer.encoder.layers.2.linear2.weight transformer.encoder.layers.2.linear2.bias transformer.encoder.layers.2.norm2.weight transformer.encoder.layers.2.norm2.bias transformer.encoder.layers.3.self_attn.sampling_offsets.weight transformer.encoder.layers.3.self_attn.sampling_offsets.bias transformer.encoder.layers.3.self_attn.attention_weights.weight transformer.encoder.layers.3.self_attn.attention_weights.bias transformer.encoder.layers.3.self_attn.value_proj.weight transformer.encoder.layers.3.self_attn.value_proj.bias transformer.encoder.layers.3.self_attn.output_proj.weight transformer.encoder.layers.3.self_attn.output_proj.bias transformer.encoder.layers.3.norm1.weight transformer.encoder.layers.3.norm1.bias transformer.encoder.layers.3.linear1.weight transformer.encoder.layers.3.linear1.bias transformer.encoder.layers.3.linear2.weight transformer.encoder.layers.3.linear2.bias transformer.encoder.layers.3.norm2.weight transformer.encoder.layers.3.norm2.bias transformer.encoder.layers.4.self_attn.sampling_offsets.weight transformer.encoder.layers.4.self_attn.sampling_offsets.bias transformer.encoder.layers.4.self_attn.attention_weights.weight transformer.encoder.layers.4.self_attn.attention_weights.bias transformer.encoder.layers.4.self_attn.value_proj.weight transformer.encoder.layers.4.self_attn.value_proj.bias transformer.encoder.layers.4.self_attn.output_proj.weight transformer.encoder.layers.4.self_attn.output_proj.bias transformer.encoder.layers.4.norm1.weight transformer.encoder.layers.4.norm1.bias transformer.encoder.layers.4.linear1.weight transformer.encoder.layers.4.linear1.bias transformer.encoder.layers.4.linear2.weight transformer.encoder.layers.4.linear2.bias transformer.encoder.layers.4.norm2.weight transformer.encoder.layers.4.norm2.bias transformer.encoder.layers.5.self_attn.sampling_offsets.weight transformer.encoder.layers.5.self_attn.sampling_offsets.bias transformer.encoder.layers.5.self_attn.attention_weights.weight transformer.encoder.layers.5.self_attn.attention_weights.bias transformer.encoder.layers.5.self_attn.value_proj.weight transformer.encoder.layers.5.self_attn.value_proj.bias transformer.encoder.layers.5.self_attn.output_proj.weight transformer.encoder.layers.5.self_attn.output_proj.bias transformer.encoder.layers.5.norm1.weight transformer.encoder.layers.5.norm1.bias transformer.encoder.layers.5.linear1.weight transformer.encoder.layers.5.linear1.bias transformer.encoder.layers.5.linear2.weight transformer.encoder.layers.5.linear2.bias transformer.encoder.layers.5.norm2.weight transformer.encoder.layers.5.norm2.bias transformer.decoder.layers.0.cross_attn.sampling_offsets.weight transformer.decoder.layers.0.cross_attn.sampling_offsets.bias transformer.decoder.layers.0.cross_attn.attention_weights.weight transformer.decoder.layers.0.cross_attn.attention_weights.bias transformer.decoder.layers.0.cross_attn.value_proj.weight transformer.decoder.layers.0.cross_attn.value_proj.bias transformer.decoder.layers.0.cross_attn.output_proj.weight transformer.decoder.layers.0.cross_attn.output_proj.bias transformer.decoder.layers.0.norm1.weight transformer.decoder.layers.0.norm1.bias transformer.decoder.layers.0.self_attn.in_proj_weight transformer.decoder.layers.0.self_attn.in_proj_bias transformer.decoder.layers.0.self_attn.out_proj.weight transformer.decoder.layers.0.self_attn.out_proj.bias transformer.decoder.layers.0.norm2.weight transformer.decoder.layers.0.norm2.bias transformer.decoder.layers.0.linear1.weight transformer.decoder.layers.0.linear1.bias transformer.decoder.layers.0.linear2.weight transformer.decoder.layers.0.linear2.bias transformer.decoder.layers.0.norm3.weight transformer.decoder.layers.0.norm3.bias transformer.decoder.layers.1.cross_attn.sampling_offsets.weight transformer.decoder.layers.1.cross_attn.sampling_offsets.bias transformer.decoder.layers.1.cross_attn.attention_weights.weight transformer.decoder.layers.1.cross_attn.attention_weights.bias transformer.decoder.layers.1.cross_attn.value_proj.weight transformer.decoder.layers.1.cross_attn.value_proj.bias transformer.decoder.layers.1.cross_attn.output_proj.weight transformer.decoder.layers.1.cross_attn.output_proj.bias transformer.decoder.layers.1.norm1.weight transformer.decoder.layers.1.norm1.bias transformer.decoder.layers.1.self_attn.in_proj_weight transformer.decoder.layers.1.self_attn.in_proj_bias transformer.decoder.layers.1.self_attn.out_proj.weight transformer.decoder.layers.1.self_attn.out_proj.bias transformer.decoder.layers.1.norm2.weight transformer.decoder.layers.1.norm2.bias transformer.decoder.layers.1.linear1.weight transformer.decoder.layers.1.linear1.bias transformer.decoder.layers.1.linear2.weight transformer.decoder.layers.1.linear2.bias transformer.decoder.layers.1.norm3.weight transformer.decoder.layers.1.norm3.bias transformer.decoder.layers.2.cross_attn.sampling_offsets.weight transformer.decoder.layers.2.cross_attn.sampling_offsets.bias transformer.decoder.layers.2.cross_attn.attention_weights.weight transformer.decoder.layers.2.cross_attn.attention_weights.bias transformer.decoder.layers.2.cross_attn.value_proj.weight transformer.decoder.layers.2.cross_attn.value_proj.bias transformer.decoder.layers.2.cross_attn.output_proj.weight transformer.decoder.layers.2.cross_attn.output_proj.bias transformer.decoder.layers.2.norm1.weight transformer.decoder.layers.2.norm1.bias transformer.decoder.layers.2.self_attn.in_proj_weight transformer.decoder.layers.2.self_attn.in_proj_bias transformer.decoder.layers.2.self_attn.out_proj.weight transformer.decoder.layers.2.self_attn.out_proj.bias transformer.decoder.layers.2.norm2.weight transformer.decoder.layers.2.norm2.bias transformer.decoder.layers.2.linear1.weight transformer.decoder.layers.2.linear1.bias transformer.decoder.layers.2.linear2.weight transformer.decoder.layers.2.linear2.bias transformer.decoder.layers.2.norm3.weight transformer.decoder.layers.2.norm3.bias transformer.decoder.layers.3.cross_attn.sampling_offsets.weight transformer.decoder.layers.3.cross_attn.sampling_offsets.bias transformer.decoder.layers.3.cross_attn.attention_weights.weight transformer.decoder.layers.3.cross_attn.attention_weights.bias transformer.decoder.layers.3.cross_attn.value_proj.weight transformer.decoder.layers.3.cross_attn.value_proj.bias transformer.decoder.layers.3.cross_attn.output_proj.weight transformer.decoder.layers.3.cross_attn.output_proj.bias transformer.decoder.layers.3.norm1.weight transformer.decoder.layers.3.norm1.bias transformer.decoder.layers.3.self_attn.in_proj_weight transformer.decoder.layers.3.self_attn.in_proj_bias transformer.decoder.layers.3.self_attn.out_proj.weight transformer.decoder.layers.3.self_attn.out_proj.bias transformer.decoder.layers.3.norm2.weight transformer.decoder.layers.3.norm2.bias transformer.decoder.layers.3.linear1.weight transformer.decoder.layers.3.linear1.bias transformer.decoder.layers.3.linear2.weight transformer.decoder.layers.3.linear2.bias transformer.decoder.layers.3.norm3.weight transformer.decoder.layers.3.norm3.bias transformer.decoder.layers.4.cross_attn.sampling_offsets.weight transformer.decoder.layers.4.cross_attn.sampling_offsets.bias transformer.decoder.layers.4.cross_attn.attention_weights.weight transformer.decoder.layers.4.cross_attn.attention_weights.bias transformer.decoder.layers.4.cross_attn.value_proj.weight transformer.decoder.layers.4.cross_attn.value_proj.bias transformer.decoder.layers.4.cross_attn.output_proj.weight transformer.decoder.layers.4.cross_attn.output_proj.bias transformer.decoder.layers.4.norm1.weight transformer.decoder.layers.4.norm1.bias transformer.decoder.layers.4.self_attn.in_proj_weight transformer.decoder.layers.4.self_attn.in_proj_bias transformer.decoder.layers.4.self_attn.out_proj.weight transformer.decoder.layers.4.self_attn.out_proj.bias transformer.decoder.layers.4.norm2.weight transformer.decoder.layers.4.norm2.bias transformer.decoder.layers.4.linear1.weight transformer.decoder.layers.4.linear1.bias transformer.decoder.layers.4.linear2.weight transformer.decoder.layers.4.linear2.bias transformer.decoder.layers.4.norm3.weight transformer.decoder.layers.4.norm3.bias transformer.decoder.layers.5.cross_attn.sampling_offsets.weight transformer.decoder.layers.5.cross_attn.sampling_offsets.bias transformer.decoder.layers.5.cross_attn.attention_weights.weight transformer.decoder.layers.5.cross_attn.attention_weights.bias transformer.decoder.layers.5.cross_attn.value_proj.weight transformer.decoder.layers.5.cross_attn.value_proj.bias transformer.decoder.layers.5.cross_attn.output_proj.weight transformer.decoder.layers.5.cross_attn.output_proj.bias transformer.decoder.layers.5.norm1.weight transformer.decoder.layers.5.norm1.bias transformer.decoder.layers.5.self_attn.in_proj_weight transformer.decoder.layers.5.self_attn.in_proj_bias transformer.decoder.layers.5.self_attn.out_proj.weight transformer.decoder.layers.5.self_attn.out_proj.bias transformer.decoder.layers.5.norm2.weight transformer.decoder.layers.5.norm2.bias transformer.decoder.layers.5.linear1.weight transformer.decoder.layers.5.linear1.bias transformer.decoder.layers.5.linear2.weight transformer.decoder.layers.5.linear2.bias transformer.decoder.layers.5.norm3.weight transformer.decoder.layers.5.norm3.bias transformer.decoder.norm.weight transformer.decoder.norm.bias transformer.decoder.ref_point_head.layers.0.weight transformer.decoder.ref_point_head.layers.0.bias transformer.decoder.ref_point_head.layers.1.weight transformer.decoder.ref_point_head.layers.1.bias transformer.decoder.bbox_embed.0.layers.0.weight transformer.decoder.bbox_embed.0.layers.0.bias transformer.decoder.bbox_embed.0.layers.1.weight transformer.decoder.bbox_embed.0.layers.1.bias transformer.decoder.bbox_embed.0.layers.2.weight transformer.decoder.bbox_embed.0.layers.2.bias transformer.decoder.class_embed.0.weight transformer.decoder.class_embed.0.bias transformer.tgt_embed.weight transformer.enc_output.weight transformer.enc_output.bias transformer.enc_output_norm.weight transformer.enc_output_norm.bias transformer.enc_out_bbox_embed.layers.0.weight transformer.enc_out_bbox_embed.layers.0.bias transformer.enc_out_bbox_embed.layers.1.weight transformer.enc_out_bbox_embed.layers.1.bias transformer.enc_out_bbox_embed.layers.2.weight transformer.enc_out_bbox_embed.layers.2.bias transformer.enc_out_class_embed.weight transformer.enc_out_class_embed.bias label_enc.weight input_proj.0.0.weight input_proj.0.0.bias input_proj.0.1.weight input_proj.0.1.bias input_proj.1.0.weight input_proj.1.0.bias input_proj.1.1.weight input_proj.1.1.bias input_proj.2.0.weight input_proj.2.0.bias input_proj.2.1.weight input_proj.2.1.bias input_proj.3.0.weight input_proj.3.0.bias input_proj.3.1.weight input_proj.3.1.bias backbone.0.body.conv1.weight backbone.0.body.layer1.0.conv1.weight backbone.0.body.layer1.0.conv2.weight backbone.0.body.layer1.0.conv3.weight backbone.0.body.layer1.0.downsample.0.weight backbone.0.body.layer1.1.conv1.weight backbone.0.body.layer1.1.conv2.weight backbone.0.body.layer1.1.conv3.weight backbone.0.body.layer1.2.conv1.weight backbone.0.body.layer1.2.conv2.weight backbone.0.body.layer1.2.conv3.weight backbone.0.body.layer2.0.conv1.weight backbone.0.body.layer2.0.conv2.weight backbone.0.body.layer2.0.conv3.weight backbone.0.body.layer2.0.downsample.0.weight backbone.0.body.layer2.1.conv1.weight backbone.0.body.layer2.1.conv2.weight backbone.0.body.layer2.1.conv3.weight backbone.0.body.layer2.2.conv1.weight backbone.0.body.layer2.2.conv2.weight backbone.0.body.layer2.2.conv3.weight backbone.0.body.layer2.3.conv1.weight backbone.0.body.layer2.3.conv2.weight backbone.0.body.layer2.3.conv3.weight backbone.0.body.layer3.0.conv1.weight backbone.0.body.layer3.0.conv2.weight backbone.0.body.layer3.0.conv3.weight backbone.0.body.layer3.0.downsample.0.weight backbone.0.body.layer3.1.conv1.weight backbone.0.body.layer3.1.conv2.weight backbone.0.body.layer3.1.conv3.weight backbone.0.body.layer3.2.conv1.weight backbone.0.body.layer3.2.conv2.weight backbone.0.body.layer3.2.conv3.weight backbone.0.body.layer3.3.conv1.weight backbone.0.body.layer3.3.conv2.weight backbone.0.body.layer3.3.conv3.weight backbone.0.body.layer3.4.conv1.weight backbone.0.body.layer3.4.conv2.weight backbone.0.body.layer3.4.conv3.weight backbone.0.body.layer3.5.conv1.weight backbone.0.body.layer3.5.conv2.weight backbone.0.body.layer3.5.conv3.weight backbone.0.body.layer4.0.conv1.weight backbone.0.body.layer4.0.conv2.weight backbone.0.body.layer4.0.conv3.weight backbone.0.body.layer4.0.downsample.0.weight backbone.0.body.layer4.1.conv1.weight backbone.0.body.layer4.1.conv2.weight backbone.0.body.layer4.1.conv3.weight backbone.0.body.layer4.2.conv1.weight backbone.0.body.layer4.2.conv2.weight backbone.0.body.layer4.2.conv3.weight

When using the pretrained model to run inference on a set of images, I want to extract embeddings (hidden layer output) that will help determine the object. To elaborate, if I were to visualize these embeddings, the vector space of similar objects would be clustered together. Therefore, I want to extract those embeddings that carry the most identifying features of a detected object from an image.

So, are input_proj and backbone the layers immediately before the output layer ? Are these layers important for my task?

SlongLiu commented 1 year ago

So, are input_proj and backbone the layers immediately before the output layer ? Are these layers important for my task? I'm afraid not. There are extra Transformer encoder and decoder layers after the backbone. I think these modules are important for model performance. However, I have no idea how much it influences your task.

To extract object features, I recommend using the hs[0] in https://github.com/IDEA-Research/DINO/blob/main/models/dino/dino.py#L270 . The dim of hs[0] is bs, num_query, feature_dim, (default: bs, 900, 256), which means 900 objects detected (some of them mean no object).

anjugopinath commented 1 year ago

Thank You for your response. This is really helpful. I had a few more questions. Could you answer them, please?

Question 1) This is from the paper:

image

What I understand is that the backbone comes first which uses CNN to extract features. But, in the output of model.named_parameters(), layers 0-4 of the backbone comes at the very end.

Are there more layers of the backbone at the very end of the model?


Question 2)

image

For my second question, I refer to your reply. When you say model performance, does it only include training efficiency ? Because if these modules also improve detection performance, then I would like to get the output of the module that results in the best detection performance (excluding the last layer).


Question 3)

image

My third question is related to the other modules (apart from transformer encoder and decoder). As shown in the image, there is a layer called transformer.enc_output (red box). At the same time there is a layer called transformer.encoder (that appears before transformer.decoder (green box) - I am not putting the image starting from transformer.encoder here since that would be too big).

How is the transformer.enc_output layer different from transformer.encoder? Is it actually the output of the transformer.encoder layer, just that the order is messed up in the output of model.named_parameters?


Question 4) Finally, what is the layer input_proj ? Is it part of the calculation for positional encoding from the original transformer paper?