Thank you first for sharing your great work for the community :)
According to your published paper, camera location embeddings (tau{k}) are subtracted from map-view positional encodings (c^{n}) to make map-view queries (c^{n} - tau{k}).
However, I found from your code that camera location embeddings (tau{k}) are also subtracted from camera positional embeddings (delta{k,i}), which is different from equation 3. Please see the last two lines of the following code.
# -------------------------
# translation embedding, tau_{k}
# -------------------------
c = E_inv[..., -1:] # b n 4 1
c_flat = rearrange(c, 'b n ... -> (b n) ...')[..., None] # (b n) 4 1 1
c_embed = self.cam_embed(c_flat) # (b n) d 1 1
# -------------------------
# R_{k}^{-1} X K_{k}^{-1} X x_{i}^{(I)}
# -------------------------
pixel_flat = rearrange(pixel, '... h w -> ... (h w)') # 1 1 3 (h w)
cam = I_inv @ pixel_flat # b n 3 (h w)
cam = F.pad(cam, (0, 0, 0, 1, 0, 0, 0, 0), value=1) # b n 4 (h w)
d = E_inv @ cam # b n 4 (h w)
d_flat = rearrange(d, 'b n d (h w) -> (b n) d h w', h=h, w=w) # (b n) 4 h w
d_embed = self.img_embed(d_flat)
# -------------------------
# Normalization for attention
# -------------------------
# TODO : why subtract c_embed?
img_embed = d_embed - c_embed # (b n) d h w
img_embed = img_embed / (img_embed.norm(dim=1, keepdim=True) + 1e-7) # (b n) d h w
Thank you first for sharing your great work for the community :)
According to your published paper, camera location embeddings (tau{k}) are subtracted from map-view positional encodings (c^{n}) to make map-view queries (c^{n} - tau{k}).
However, I found from your code that camera location embeddings (tau{k}) are also subtracted from camera positional embeddings (delta{k,i}), which is different from equation 3. Please see the last two lines of the following code.
Am I missing something?