chenyilun95 / DSGN2

DSGN++: Exploiting Visual-Spatial Relation for Stereo-based 3D Detectors (T-PAMI 2022)
Apache License 2.0
79 stars 8 forks source link

Some questions about 3DGV, PSV and Front-Surface Depth Head #7

Closed SibylGao closed 1 year ago

SibylGao commented 1 year ago

Hi! Thanks for sharing your awesome work, but i am so confused about the coordinate system in your code. Firstly, depth-wise cost volumes are build in PSV:

cost_raw = self.build_cost(left_stereo_feat, right_stereo_feat,
                None, None, downsampled_disp, psv_disps_channels.to(torch.int32))

Then, a 3d mesh grid in pseudo lidar coordinate is generated:

def prepare_coordinates_3d(self, point_cloud_range, voxel_size, grid_size, sample_rate=(1, 1, 1)):
        self.X_MIN, self.Y_MIN, self.Z_MIN = point_cloud_range[:3]
        self.X_MAX, self.Y_MAX, self.Z_MAX = point_cloud_range[3:]
        self.VOXEL_X_SIZE, self.VOXEL_Y_SIZE, self.VOXEL_Z_SIZE = voxel_size
        self.GRID_X_SIZE, self.GRID_Y_SIZE, self.GRID_Z_SIZE = grid_size.tolist()

        self.VOXEL_X_SIZE /= sample_rate[0]
        self.VOXEL_Y_SIZE /= sample_rate[1]
        self.VOXEL_Z_SIZE /= sample_rate[2]

        self.GRID_X_SIZE *= sample_rate[0]
        self.GRID_Y_SIZE *= sample_rate[1]
        self.GRID_Z_SIZE *= sample_rate[2]

        zs = torch.linspace(self.Z_MIN + self.VOXEL_Z_SIZE / 2., self.Z_MAX - self.VOXEL_Z_SIZE / 2.,
                            self.GRID_Z_SIZE, dtype=torch.float32)
        ys = torch.linspace(self.Y_MIN + self.VOXEL_Y_SIZE / 2., self.Y_MAX - self.VOXEL_Y_SIZE / 2.,
                            self.GRID_Y_SIZE, dtype=torch.float32)
        xs = torch.linspace(self.X_MIN + self.VOXEL_X_SIZE / 2., self.X_MAX - self.VOXEL_X_SIZE / 2.,
                            self.GRID_X_SIZE, dtype=torch.float32)
        zs, ys, xs = torch.meshgrid(zs, ys, xs)
        coordinates_3d = torch.stack([xs, ys, zs], dim=-1)
        self.coordinates_3d = coordinates_3d.float()

and 3D world mesh grid to camera frustum space (I believe is via the torch.cat operation?):

def compute_mapping(c3d, image_shape, calib_proj, depth_range, pose_transform=None):
        import pdb; pdb.set_trace()
        coord_img = project_rect_to_image(
            c3d,
            calib_proj,
            pose_transform)
        coord_img = torch.cat(
            [coord_img, c3d[..., 2:]], dim=-1)
        crop_x1, crop_x2 = 0, image_shape[1]
        crop_y1, crop_y2 = 0, image_shape[0]
        norm_coord_img = (coord_img - torch.as_tensor([crop_x1, crop_y1, depth_range[0]], device=coord_img.device)) / torch.as_tensor(
            [crop_x2 - 1 - crop_x1, crop_y2 - 1 - crop_y1, depth_range[1] - depth_range[0]], device=coord_img.device)
        # resize to [-1, 1]
        norm_coord_img = norm_coord_img * 2. - 1.
        return coord_img, norm_coord_img

But I'm really confused by the following grid_sample operations, such as:

Voxel = F.grid_sample(out, norm_coord_imgs, align_corners=True)

and:

Voxel_2D = self.build_3d_geometry_volume(left_sem_feat, norm_coord_imgs, voxel_disps)

So this means that out(cost0) and left_sem_feat are both in image coordinate, and are mapping to normalized camera frustum space by filling the grid (for cost volume, its value is sampled in the volume, while for sem_fea, its grid is filled by replication along the depth axis )? After that, the Voxel is 'grid_sampled' again in the last depth estimation stage:

PSV_from_3dgv = F.grid_sample(Voxel, norm_coordinates_psv_to_3d)

It would be so helpful if you shared more details about the coordinate system and coordinate transformation in your code :confounded: Thank you so much.

chenyilun95 commented 1 year ago

Hi, Thanks for your interest in the work. Voxel = F.grid_sample(out, norm_coord_imgs, align_corners=True) This transformation maps the PSV with coordinates (u, v, d) to a 3D voxel grid with coordinates (x, y, z).

Voxel_2D = self.build_3d_geometry_volume(left_sem_feat, norm_coord_imgs, voxel_disps) This transformation maps the camera features (HxWxC) to a 3D voxel grid with coordinates (x, y, z).

PSV_from_3dgv = F.grid_sample(Voxel, norm_coordinates_psv_to_3d) This transformation maps the 3D voxel grid with coordinate (x, y, z) back to the frustum space (u, v, d) and compute the depth map loss. (Front Surface Depth Head)

SibylGao commented 1 year ago

3D voxel grid with coordinates (x, y, z).

Thanks for your reply! It really helps a lot. So the 3D voxel grid coordinate (x, y, z) actually means (depth, width, height) in pseudo lidar coordinate?

chenyilun95 commented 1 year ago

Yes, just note that The voxel grid is with shape [Channel, Height (z), Width (y), Depth (x)] where LiDAR coordinate should be [x (forward), y (left), z (up)].

SibylGao commented 1 year ago

By the way, have you tried train & eval on max_disp = 16, surprisingly, I got results very close to that train & eval on max_disp=288, is it might caused by overfitting?

chenyilun95 commented 1 year ago

Hi, I think it should not get similar results. Could you show the shape of your generated feature size (3DGV) with these modifications? Perhaps it is due to bugs like duplicated config definitions.