NVlabs / nvdiffrec

Official code for the CVPR 2022 (oral) paper "Extracting Triangular 3D Models, Materials, and Lighting From Images".
Other
2.09k stars 222 forks source link

CO3D dataset #25

Open ghost opened 2 years ago

ghost commented 2 years ago

Hi. Thanks for sharing the code. Great work!

I am trying to train the model with objects from the CO3D dataset (link). However, I am getting pretty bad results and i suspect that my camera matrices are incorrect. I would really appriciate some help if possible.

Here is how i create the matrices (i used this article as a reference to create the projection matrix) :


def get_camera(self, annotations):
        image_size = annotations["image"]["size"]
        viewpoint_frame = annotations["viewpoint_frame"]
        R = viewpoint_frame['R']
        T = viewpoint_frame['T']
        focal_length = viewpoint_frame['focal_length']
        principal_point = torch.tensor(viewpoint_frame['principal_point'])

        mv = torch.tensor([
            [R[0][0],R[0][1],R[0][2], T[0]],
            [R[1][0],R[1][1],R[1][2], T[1]],
            [R[2][0],R[2][1],R[2][2], T[2]],
            [    0 ,     0,      0,    1  ],
        ], dtype=torch.float32)

        # Transform mv from Pytorch3D to OpenGL coordinates 
        rotate_y = util.rotate_y(math.pi)
        mv = rotate_y @ mv

        # Convert principal_point and focal_length from NDC space to pixel space
        half_image_size = torch.tensor(list(reversed(image_size))) / 2.0
        principal_point_px = (-1.0 * (principal_point-1) * half_image_size)
        focal_length_px =  torch.tensor(focal_length) * half_image_size

        A = 2*focal_length_px[0] / image_size[1] #2*f/w
        B = 2*focal_length_px[1] / image_size[0] #2*f/h
        C = (image_size[1] - 2*principal_point_px[0]) / image_size[1] #(w – 2*cx)/w
        D = (image_size[0] - 2*principal_point_px[1]) / image_size[0] #(h – 2*cy)/h
        n=0.1
        f=1000.0

        proj = torch.tensor([
            [A, 0, -C, 0],
            [0, -B, D, 0],
            [0, 0.0, (-f - n) / (f - n), -2.0*f*n/(f-n)],
            [0, 0, -1, 0],
        ])

        campos = torch.linalg.inv(mv)[:3, 3]
        mvp    = proj @ mv

Result:

img_dmtet_pass1_000010

jmunkberg commented 2 years ago

Thanks @evenfh ,

We haven't tested with CO3D (yet), so I cannot quickly answer your question. Is the format of CO3D similar to the DTU MVS format?

We did write a loader for DTU (unfortunately not available in the public release). For that I used https://docs.opencv.org/3.4/d9/d0c/group__calib3d.html (see Detailed Description) and the decomposeProjectionMatrix function https://docs.opencv.org/3.4/d9/d0c/group__calib3d.html#gaaae5a7899faa1ffdf268cd9088940248 and built an OpenGL projection matrix from that, see slide 25 here: https://fileadmin.cs.lth.se/cs/Education/EDA221/lectures/latest/Lecture5_web.pdf

JHnvidia commented 2 years ago

Please note that we do not expect good quality results from CO3D as it breaks many of our assumptions. Compared to the nerf/nerd datasets it has less controlled lighting, lower quality segmentation masks, and probably more inaccurate camera positions. For some more info, see: https://github.com/NVlabs/nvdiffrec/issues/10

ghost commented 2 years ago

Thaks for the reply @jmunkberg and @JHnvidia ,

I am not familiar with the DTU MVS format, unfortunately. The CO3D dataset provides focal length and principal point in NDC space. It also provides extrinsic camera parameters(rotation and translation) which i believe uses the pytorch3d coordinate system convention. In Pytorch3D +Z points in to the screen whereas OpenGL has +Z pointing out of the screen, explained here: https://pytorch3d.org/docs/renderer_getting_started (see Pytorch3D vs OpenGL).

I read your comment about the segmentation masks, @JHnvidia. However, i created new segmentation masks by hand so that particular aspect should not be a problem. However, I agree that the inaccurate camera positions and the lighting might be part of the issue here.

I updated the code above with some comments.