facebookresearch / detr

End-to-End Object Detection with Transformers
Apache License 2.0
13.42k stars 2.42k forks source link

what is the output size of transformer encoder and decoder? #560

Open ericosmic opened 1 year ago

ericosmic commented 1 year ago

I try to analyse the output size of transformer encoder and decoder ouput, and use register_forward_hook module(following the instrunstion of https://colab.research.google.com/github/facebookresearch/detr/blob/colab/notebooks/detr_attention.ipynb) to get output of last layer in encoder and decoder, like this function below

image

and if input size is (1, 3, 800, 800) , then the given result show the output of encoder and decoder output size is (1, 625, 625) and (1, 100, 625) respectively .

image

But if I use torchinfo.summary to detect the output size of each layer in transformer based on same input size , the output size of last layer of encoder and decoder is (625, batchsize, 256) and (layers-num, 100, batchsize, 256), i.e.

image

I think the point is there is different in the encoder output size , So what dose cause this different ? Anyone know?

NielsRogge commented 1 year ago

Hi,

It's a bit easier to test this using the HuggingFace implementation of DETR. You can just pass output_hidden_states=True to the forward of the model and verify the shapes of the encoder and decoder hidden states.

Let's illustrate with an example. Let's first instantiate the model and its processor:

from transformers import DetrImageProcessor, DetrForObjectDetection

processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")

Next, let's prepare an image for the model:

import requests
from PIL import Image

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(image, return_tensors="pt")

We can now forward the inputs through the model:

import torch

with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)

The outputs is a dictionary, containing the class logits, predicted bounding boxes, but also the intermediate activations as we set output_hidden_states=True. Let's check the shape of the final encoder hidden states:

outputs['encoder_last_hidden_state'].shape

In this case, it prints torch.Size([1, 850, 256]). This is because the image first gets forwarded through a ResNet backbone, which outputs a 2D feature map. Next, this feature map gets flattened into one sequence that is sent through the Transformer encoder.

Let's check the Transformer decoder's last hidden state:

outputs['last_hidden_state'].shape

In this case, it prints torch.Size([1, 100, 256]). This is because the decoder of DETR uses 100 so-called object queries to detect at most 100 objects in the image. I'd recommend reading the DETR paper to see why these object queries are defined.

ericosmic commented 1 year ago

@NielsRogge In your answer, the output size of encoder and decoder is equal to torchinfo.summary result of my post. But I still want to know why it is different with the output size of register_forward_hook module ? In my understanding of DETR, the output of decoder could be reshaped as the height and width size of conv backbone result , i.e. (batchsize, 100, height, width), just like the operation of https://colab.research.google.com/github/facebookresearch/detr/blob/colab/notebooks/detr_attention.ipynb. So it means the decoder output size should be (batchsize, 100, height*width).

MLDeS commented 11 months ago

with torch.no_grad(): outputs = model(**inputs, output_hidden_states=True)

@NielsRogge Thanks for the example.

I looked at the Huggingface space's DETR repo for the outputs and attention visualization and I see that there are two parameters that I could want to visualize, 1) decoder_attentions -> Attentions weights of the decoder, after the attention softmax and 2) cross_attentions -> Attentions weights of the decoder’s cross-attention layer, after the attention softmax. My question is, the former is then the self-attention weights?

Also, I do not see these different parameters in others repos, e.g., Blip2. There only attentions and cross_attention. How could I visualize the decoder_attentionsimilar to DETR? Could you please guide? I know this repo is only for DETR, and would be ok to take this conversation somewhere else if it extends beyond this.

Thanks a lot!