bradyz / cross_view_transformers

Cross-view Transformers for real-time Map-view Semantic Segmentation (CVPR 2022 Oral)
MIT License
531 stars 81 forks source link

Question about the implementation of 'camera-aware positional encoding' part #45

Open d1024choi opened 1 year ago

d1024choi commented 1 year ago

Thank you first for sharing your great work for the community :)

According to your published paper, camera location embeddings (tau{k}) are subtracted from map-view positional encodings (c^{n}) to make map-view queries (c^{n} - tau{k}).

However, I found from your code that camera location embeddings (tau{k}) are also subtracted from camera positional embeddings (delta{k,i}), which is different from equation 3. Please see the last two lines of the following code.

# -------------------------
# translation embedding, tau_{k}
# -------------------------
c = E_inv[..., -1:]                                                     # b n 4 1
c_flat = rearrange(c, 'b n ... -> (b n) ...')[..., None]                # (b n) 4 1 1
c_embed = self.cam_embed(c_flat)                                        # (b n) d 1 1

# -------------------------
# R_{k}^{-1} X K_{k}^{-1} X x_{i}^{(I)}
# -------------------------
pixel_flat = rearrange(pixel, '... h w -> ... (h w)')                   # 1 1 3 (h w)
cam = I_inv @ pixel_flat                                                # b n 3 (h w)
cam = F.pad(cam, (0, 0, 0, 1, 0, 0, 0, 0), value=1)                     # b n 4 (h w)
d = E_inv @ cam                                                         # b n 4 (h w)
d_flat = rearrange(d, 'b n d (h w) -> (b n) d h w', h=h, w=w)           # (b n) 4 h w
d_embed = self.img_embed(d_flat)   

# -------------------------
# Normalization for attention
# -------------------------
# TODO : why subtract c_embed?
img_embed = d_embed - c_embed                                           # (b n) d h w
img_embed = img_embed / (img_embed.norm(dim=1, keepdim=True) + 1e-7)    # (b n) d h w

Am I missing something?