aharley / simple_bev

A Simple Baseline for BEV Perception
MIT License
502 stars 79 forks source link

Questions regarding backbone network #26

Closed henriquepm closed 1 year ago

henriquepm commented 1 year ago

Hi! First of all thank you for the great quality of this work, both the paper and the code. I have a couple of doubts regarding the backbone:

  1. As mentioned in issue #24 the image features in the repo come from the concatenation of the output of the second layer and the upsampled output of the third layer. In the paper, it is instead stated that the features come from the concatenation of the output of the third layer with the upsampled output of the last layer, leading to feature maps of dimension C x H/8 x W/8, while the approach in the code will produce FM of dimension C x H/4 x W/4. From which of the two approaches come the results reported in the paper? Does this difference have significant effect on performance (if both have been tested)?
  2. In the paper it is mentioned that the ResNet-101 backbone is initialized from COCO pretraining citing the DETR paper, while in the code the network is initialized from torchvision default weights (ImageNet pretraining). In the experiment sections of the paper, the effect of input resolution is discussed and it is hypothesised that the decreasing performance with higher resolution could be explained due to worse transfer from inconsistency with the pretraining scale. Do the results in this section of the code come from the approach described in the paper (COCO-pretraining) or the one in the code? In case you have run experiments with both approaches, does this make any significant difference? Thanks again.
aharley commented 1 year ago

Thanks for these questions.

  1. The paper results come from this repo (or a slightly messier version of it). I will either update the dimension line in the paper, or add an experiment with H/8 x W/8. (Do you already know if H/8 x W/8 is much different?)
  2. That's a great point. I need to think and check back to see why the paper says COCO while clearly the code indicates Imagenet. It could be that we used coco inits very early on, then switched to imagenet while simplifying the codebase.
henriquepm commented 1 year ago

Thanks for the quick answer, I do not know atm, I'm planning to run some experiments with the backbones and wanted to understand the departure point as well as possible.

aharley commented 1 year ago

@henriquepm I'm coming back to this to check the /4 and /8 stuff. I added a bunch of shape prints to the forward of Encoder_res101, and right now I'm not sure why you said "the approach in the code will produce FM of dimension C x H/4 x W/4".

def forward(self, x):
        print('x in', x.shape)
        x1 = self.backbone(x)                                                                                                               
        print('x1', x1.shape)
        x2 = self.layer3(x1)                                                                                                                 
        print('x2', x2.shape)
        x = self.upsampling_layer(x2, x1)
        print('x up', x.shape)
        x = self.depth_layer(x)
        print('x d', x.shape)
        return x

The output is:

x in torch.Size([6, 3, 448, 800])
x1 torch.Size([6, 512, 56, 100])
x2 torch.Size([6, 1024, 28, 50])
x up torch.Size([6, 512, 56, 100])
x d torch.Size([6, 128, 56, 100])

which looks like H/8, W/8 like the paper said. I may easily have missed something because I haven't used the repo in a little bit, so please let me know if you see something wrong.

henriquepm commented 1 year ago

Hey, that looks totally right, sorry about that. I took a look at the notebook where I was looking at the network and dissecting it. I was comparing the size against the output of the first conv layer of the resnet instead of the proper input so I was missing a 1/2 factor.

aharley commented 1 year ago

Perfect, no problem. Thanks for confirming so quickly!