bradyz / cross_view_transformers

Cross-view Transformers for real-time Map-view Semantic Segmentation (CVPR 2022 Oral)
MIT License
525 stars 80 forks source link

image embedding calculation #13

Closed DerrickXuNu closed 2 years ago

DerrickXuNu commented 2 years ago

Hey Brady,

In your encoder.py line 255, you used:

img_embed = d_embed - c_embed    

To my understanding, here you want to subtract the camera translation information from the image coordinate embedding. However, I think the translation information is already included in the image coordinate embedding:

  pixel_flat = rearrange(pixel, '... h w -> ... (h w)')                   # 1 1 3 (h w)
  cam = I_inv @ pixel_flat                                                # b n 3 (h w)
  cam = F.pad(cam, (0, 0, 0, 1, 0, 0, 0, 0), value=1)                     # b n 4 (h w)
  d = E_inv @ cam                                                         # b n 4 (h w)
  d_flat = rearrange(d, 'b n d (h w) -> (b n) d h w', h=h, w=w)           # (b n) 4 h w
  d_embed = self.img_embed(d_flat)   

where E_inv contains the translation already. So will the subtraction of the c_embed be redundant?

bradyz commented 2 years ago

You are correct in that E_inv already contains the translation. so $d$ will be in the coordinate system of the ego-vehicle

But we actually want the image embedding to be in the reference frame of the particular camera - which is why we subtract c_embed. This is also why we subtract the same c_embed when computing the BEV embedding.

BHC1205 commented 1 year ago

But we actually want the image embedding to be in the reference frame of the particular camera - which is why we subtract c_embed. This is also why we subtract the same c_embed when computing the BEV embedding.

Hi,I still have some doubts about this. Why only translation tk is considered and the rotation Rk is ignored?