ArminMasoumian / GCNDepth

Self-Supervised CNN-GCN Autoencoder for Monocular Depth Estimation
MIT License
113 stars 17 forks source link

Some confusion about the code you published #16

Closed takisu0916 closed 1 year ago

takisu0916 commented 1 year ago

I tried deploying your code on our application, which uses a custom dataset with a batch size of 4. However, when I read the 'def forward(self, input_features, frame_id=0)' in depth_decoder.py, I encountered some issues (relevant code shown below, '#' indicates the shape of the feature map).

... x4 = self.merge4(x4) #(Conv 3x3) -->[4,256,10,8] x4 = F.leaky_relu(x4) # (leaky_relu) y4 = x4.view(32*10,-1) #-->[320,256] y4 = self.gc1(y4, adj) #-->[320,320] y3 = self.gc2(y4, adj) #-->[320,1] y4 = y3.view(1, 1, 10, 32) #-->[1,1,10,32] y4 = self.do(y4) disp4 = upsample(y4) #-->[1,1,20,64] x4 = upsample(x4) #-->[4,256,20,16] ... x3 = torch.cat((x3, x4, disp4), 1)
Q: Why can we concatenate two feature maps with different shapes here? i.e, x4 (shape: [4,256,20,16]) and disp4 (shape: [1,1,20,64]).

If I have misunderstood, please help me clarify and I would be extremely grateful.

ArminMasoumian commented 1 year ago

I tried deploying your code on our application, which uses a custom dataset with a batch size of 4. However, when I read the 'def forward(self, input_features, frame_id=0)' in depth_decoder.py, I encountered some issues (relevant code shown below, '#' indicates the shape of the feature map).

... x4 = self.merge4(x4) #(Conv 3x3) -->[4,256,10,8] x4 = F.leaky_relu(x4) # (leaky_relu) y4 = x4.view(32*10,-1) #-->[320,256] y4 = self.gc1(y4, adj) #-->[320,320] y3 = self.gc2(y4, adj) #-->[320,1] y4 = y3.view(1, 1, 10, 32) #-->[1,1,10,32] y4 = self.do(y4) disp4 = upsample(y4) #-->[1,1,20,64] x4 = upsample(x4) #-->[4,256,20,16] ... x3 = torch.cat((x3, x4, disp4), 1) Q: Why can we concatenate two feature maps with different shapes here? i.e, x4 (shape: [4,256,20,16]) and disp4 (shape: [1,1,20,64]).

If I have misunderstood, please help me clarify and I would be extremely grateful.

Our decoder input size was designed to match the input size of our encoder (ResNet-50) for the KITTI dataset, which has an image resolution of 1024 x 320. If you intend to use our model for your own custom datasets, you will need to print out the output of each layer in the encoder and adjust the decoder input size accordingly to match the size of the encoder output. Essentially, the encoder output should have the same dimensions as the decoder input.