TencentARC / ViT-Lens

[CVPR 2024] ViT-Lens: Towards Omni-modal Representations
https://ailab-cvc.github.io/seed/vitlens/
Other
140 stars 9 forks source link

SUN RGB-D is not in millimeters #12

Open jbrownkramer opened 4 months ago

jbrownkramer commented 4 months ago

I was trying to apply this model to my own data and not getting good results. I ran the NYUv2 dataset through my code, and the results seem to be in line with those reported in the ViT-Lens paper.

Digging into it, the issue is - at least partly - that the NYUv2 data is not in millimeters. Here is the matlab code for converting the png files to mm that is in the SUNRGBDtoolbox (https://rgbd.cs.princeton.edu/):

depthVis = imread(data.depthpath);
imsize = size(depthVis);
depthInpaint = bitor(bitshift(depthVis,-3), bitshift(depthVis,16-3));

In other words, the data in the png files is a circular shift left by 3 bits of the depth in mm (which for most data is just multiplying by 8).

I mention this because the code in https://github.com/TencentARC/ViT-Lens/issues/9 seems to indicate that it is assumed that the data is in mm. It might be important if other datasets get used that are in mm and not the SUN RGB-D format.

StanLei52 commented 4 months ago

Thank you for pointing this out -- it is important to figure this out for a more general depth model. As such, could you please also check LanguageBind and their uploaded NYU-D -- I will look into their preprocessing pipeline instead of following ImageBind if it works on your own data.

jbrownkramer commented 4 months ago

I will look into LanguageBind.

I will say this: I updated the processing on my pipeline to match the circular shift, quantization, and camera intrinsics as the NYU data. The results on our data are still not very good. My suspicion is that SUN RGB-D has no people in it, and the text labels I am trying to match are about the locations of people in the scene.

jbrownkramer commented 4 months ago

Below is the transformation pipeline in LanguageBind. The starting format is depth in mm (NOT DISPARITY). I ran their inference example from the git homepage and max_depth is configured to 10. So in summary: read in the data in mm, convert to meters, clamp between .01 and 10 meters. Divide by 10 meters. Resize and center crop to 224, and normalize by OPENAI_DATASET_MEAN, OPENAI_DATASET_STD.

I tried running on the SUN RGB-D versions of the NYUv2 data directly and LanguageBind gave bad outputs. When I did a circular shift (to put it back into mm) it gave good results, so they are doing some preprocessing to convert the NYU data to mm first.

class DepthNorm(nn.Module):
    def __init__(
        self,
        max_depth=0,
        min_depth=0.01,
    ):
        super().__init__()
        self.max_depth = max_depth
        self.min_depth = min_depth
        self.scale = 1000.0  # nyuv2 abs.depth

    def forward(self, image):
        # image = np.array(image)
        depth_img = image / self.scale  # (H, W)   in meters
        depth_img = depth_img.clip(min=self.min_depth)
        if self.max_depth != 0:
            depth_img = depth_img.clip(max=self.max_depth)
            depth_img /= self.max_depth   #  0-1
        else:
            depth_img /= depth_img.max()
        depth_img = torch.from_numpy(depth_img).unsqueeze(0).repeat(3, 1, 1)  # assume image
        return depth_img.to(torch.get_default_dtype())

def get_depth_transform(config):
    config = config.vision_config
    transform = transforms.Compose(
        [
            DepthNorm(max_depth=config.max_depth),
            transforms.Resize(224, interpolation=transforms.InterpolationMode.BICUBIC),
            transforms.CenterCrop(224),
            transforms.Normalize(OPENAI_DATASET_MEAN, OPENAI_DATASET_STD),  # assume image
            # transforms.Normalize((0.5, ), (0.5, ))  # 0-1 to norm distribution
            # transforms.Normalize((0.0418, ), (0.0295, ))  # sun rgb-d  imagebind
            # transforms.Normalize((0.02, ), (0.00295, ))  # nyuv2
        ]
    )
    return transform
StanLei52 commented 4 months ago

Got it, thanks @jbrownkramer! I will look into this.