Question regarding feature streams

Hello, in your paper you describe the following:

For each feature map F_i ∈ {F_g, F_s, F_c} representing the geometric feature, the DINOV2 feature, and the respective RGB value, we employ a separate ResNet model L_i as feature extractor

However, in your code, I see the following instead:

https://github.com/NOrangeeroli/SecondPose/blob/89725402284c3478a217bf7c8806985f58aab287/model/VI_Net_geodino.py#L362-L367

Specifically, from my understanding, this is the input to the 'Dual-stream fusion' step

        x = self.spherical_fpn(dis_map, torch.cat([rgb_map, ref_map],dim = 1) , ppf_map)  # each input feature passes through a separate ResNet here

where

dis_map : radial distance features in spherical map representation
torch.cat([rgb_map, ref_map],dim = 1) : RGB features and DINOv2 features in spherical map representation
ppf_map : geometric point-pair features in spherical map representation

I have two questions regarding this:

Is the above understanding correct?
How do these inputs correspond to the features described in the paper? Specifically which feature does dis_map correspond to? It looks like we should have rgb_map = F_c and ref_map = F_s and ppf_map = F_g. But in fact rgb_map and ref_map are concatenated and treated as one feature rather than two, and the paper doesn't describe dis_map as an input feature even though it is in the code?

Thanks for your time!

NOrangeeroli / SecondPose

Question regarding feature streams #13