Sungmin-Woo / ProDepth

[ECCV 2024] ProDepth: Boosting Self-Supervised Multi-Frame Monocular Depth with Probabilistic Fusion
https://sungmin-woo.github.io/prodepth/
MIT License
22 stars 0 forks source link

about reproduce #2

Open Cresynia opened 1 month ago

Cresynia commented 1 month ago

Thank you for your great work! I encountered some issues when trying to reproduce the results. I used 1 GPU with a batch size of 12 on the KITTI dataset and utilized the provided KITTI checkpoints while freezing the teacher and pose. However, there is a significant discrepancy in the results. Could you please let me know what mistakes I might be making and what areas I should pay attention to. Thank you very much! image

Sungmin-Woo commented 1 month ago

Hi @Cresynia, sorry for the confusion. It appears that some hyperparameters, such as scaling factors, are missing from the current code for the KITTI dataset. We'll review the code and provide an update soon. Thanks for your patience!

Cresynia commented 1 month ago

Thanks so much!

---Original--- From: @.> Date: Mon, Jul 29, 2024 12:20 PM To: @.>; Cc: @.**@.>; Subject: Re: [Sungmin-Woo/ProDepth] about reproduce (Issue #2)

Hi @Cresynia, sorry for the confusion. It appears that some hyperparameters, such as scaling factors, are missing from the current code for the KITTI dataset. We'll review the code and provide an update soon. Thanks for your patience!

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

757787182 commented 1 month ago

Hi @Cresynia, sorry for the confusion. It appears that some hyperparameters, such as scaling factors, are missing from the current code for the KITTI dataset. We'll review the code and provide an update soon. Thanks for your patience!

Thank you for your great work! I am very interested in the depth error diagram in Figure 4. I can make this diagram and observe some model results just like you. Could you please provide the relevant code

Sungmin-Woo commented 1 month ago

Hi @757787182, thanks for your interest!

You can visualize the error maps using following codes in evaluate_depth.py. We use improved ground truth for the visualization and interpolate the GT map with cv2 dilation.

gt_masks = []
absrel_maps_mono = []
absrel_maps_multi = []

gt_disps = []
absrel_maps_mono_disp = []
absrel_maps_multi_disp = []

for i in tqdm.tqdm(range(len(gt_depths))):
    vis_gt_depth = gt_depths[i]

    vis_gt_height, vis_gt_width = vis_gt_depth.shape[:2]
    _pred_disp_z = np.squeeze(multi_pred_disps[i])
    _pred_disp_mono = np.squeeze(mono_pred_disps[i])

    vis_pred_disp_mono = cv2.resize(_pred_disp_mono, (vis_gt_width, vis_gt_height))
    vis_pred_disp_z = cv2.resize(_pred_disp_z, (vis_gt_width, vis_gt_height))

    vis_pred_depth_z = 1 / vis_pred_disp_z
    vis_pred_depth_mono = 1 / vis_pred_disp_mono

    if opt.eval_split == "eigen":
        mask = np.logical_and(vis_gt_depth > MIN_DEPTH, vis_gt_depth < MAX_DEPTH)
        crop = np.array([0.40810811 * vis_gt_height, 0.99189189 * vis_gt_height,
                        0.03594771 * vis_gt_width,  0.96405229 * vis_gt_width]).astype(np.int32)
        crop_mask = np.zeros(mask.shape)
        crop_mask[crop[0]:crop[1], crop[2]:crop[3]] = 1
        mask = np.logical_and(mask, crop_mask)

    else:
        mask = vis_gt_depth > 0

    pred_depth_z_mask = vis_pred_depth_z[mask]
    pred_depth_mono_mask = vis_pred_depth_mono[mask]

    gt_depth_mask = vis_gt_depth[mask]

    if not opt.disable_median_scaling:
        ratio1 = np.median(gt_depth_mask) / np.median(pred_depth_mono_mask)
        pred_depth_mono_mask *= ratio1
        ratio2 = np.median(gt_depth_mask) / np.median(pred_depth_z_mask)
        pred_depth_z_mask *= ratio2

    vis_pred_depth_mono *= ratio1
    vis_pred_depth_z *= ratio2

    vis_pred_depth_z[vis_pred_depth_z < MIN_DEPTH] = MIN_DEPTH
    vis_pred_depth_z[vis_pred_depth_z > MAX_DEPTH] = MAX_DEPTH
    vis_pred_depth_mono[vis_pred_depth_mono < MIN_DEPTH] = MIN_DEPTH
    vis_pred_depth_mono[vis_pred_depth_mono > MAX_DEPTH] = MAX_DEPTH

    kernel = np.ones((3, 3), np.uint8)
    vis_gt_depth = cv2.dilate(vis_gt_depth, kernel, iterations=1)
    vis_mask = vis_gt_depth > 0

    absrel_map_multi = np.zeros(vis_gt_depth.shape)
    absrel_map_multi[vis_mask] = np.abs(vis_gt_depth[vis_mask]-vis_pred_depth_z[vis_mask]) / vis_gt_depth[vis_mask]
    absrel_map_multi_disp = np.zeros(vis_gt_depth.shape)
    absrel_map_multi_disp[vis_mask] = np.abs(1/vis_gt_depth[vis_mask]-1/vis_pred_depth_z[vis_mask]) / (1/vis_gt_depth[vis_mask])

    absrel_map_mono = np.zeros(vis_gt_depth.shape)
    absrel_map_mono[vis_mask] = np.abs(vis_gt_depth[vis_mask]-vis_pred_depth_mono[vis_mask]) / vis_gt_depth[vis_mask]
    absrel_map_mono_disp = np.zeros(vis_gt_depth.shape)
    absrel_map_mono_disp[vis_mask] = np.abs(1/vis_gt_depth[vis_mask]-1/vis_pred_depth_mono[vis_mask]) / (1/vis_gt_depth[vis_mask])

    error_pil_multi = np.array(numpy_intensitymap_to_pcolor(absrel_map_multi[..., np.newaxis], vmin=0, vmax=0.5,colormap='jet', valid_mask=vis_mask))[:370,:1224,:]
    error_pil_mono = np.array(numpy_intensitymap_to_pcolor(absrel_map_mono[..., np.newaxis], vmin=0, vmax=0.5,colormap='jet', valid_mask=vis_mask))[:370,:1224,:]

    error_pil_multi_disp = np.array(numpy_intensitymap_to_pcolor(absrel_map_multi_disp[..., np.newaxis], vmin=0, vmax=0.5,colormap='jet', valid_mask=vis_mask))[:370,:1224,:]
    error_pil_mono_disp = np.array(numpy_intensitymap_to_pcolor(absrel_map_mono_disp[..., np.newaxis], vmin=0, vmax=0.5,colormap='jet', valid_mask=vis_mask))[:370,:1224,:]

    error_pil_mono = error_pil_mono[np.newaxis,:]
    error_pil_multi = error_pil_multi[np.newaxis,:]

    error_pil_mono_disp = error_pil_mono_disp[np.newaxis,:]
    error_pil_multi_disp = error_pil_multi_disp[np.newaxis,:]

    absrel_maps_mono.append(error_pil_mono)
    absrel_maps_multi.append(error_pil_multi)

    absrel_maps_mono_disp.append(error_pil_mono_disp)
    absrel_maps_multi_disp.append(error_pil_multi_disp)

absrel_maps_mono = np.concatenate(absrel_maps_mono)
absrel_maps_multi = np.concatenate(absrel_maps_multi)

absrel_maps_mono_disp = np.concatenate(absrel_maps_mono_disp)
absrel_maps_multi_disp = np.concatenate(absrel_maps_multi_disp)
jerry-ryu commented 3 weeks ago

Hi @Cresynia, sorry for the confusion. It appears that some hyperparameters, such as scaling factors, are missing from the current code for the KITTI dataset. We'll review the code and provide an update soon. Thanks for your patience!

@Sungmin-Woo Thank you for your wonderful paper. I wonder if it has been updated in the repository in the part you mentioned.

훌륭하신 논문 적어주셔서 정말 감사합니다. 많이 배우겠습니다. 위에서 언급하신 KITTI 훈련 관련 업데이트가 저장소에 적용되었는지 궁금합니다. 감사합니다!

Below are the results of the model I trained myself and the results of the pretrained weight you provided. Without other changes, the model I trained myself with bs=12, #ofgpu=2 does not reproduce the results in the paper.

아래는 직접 훈련한 모델의 결과와 제공해주신 pretrained weight의 결과입니다. 다른 변경사항 없이 bs=12, # of gpu=2로 직접 훈련한 모델은 논문에서의 결과를 재현하지 못합니다.

image

757787182 commented 3 weeks ago

Hi @757787182, thanks for your interest!

You can visualize the error maps using following codes in evaluate_depth.py. We use improved ground truth for the visualization and interpolate the GT map with cv2 dilation.

gt_masks = []
absrel_maps_mono = []
absrel_maps_multi = []

gt_disps = []
absrel_maps_mono_disp = []
absrel_maps_multi_disp = []

for i in tqdm.tqdm(range(len(gt_depths))):
    vis_gt_depth = gt_depths[i]

    vis_gt_height, vis_gt_width = vis_gt_depth.shape[:2]
    _pred_disp_z = np.squeeze(multi_pred_disps[i])
    _pred_disp_mono = np.squeeze(mono_pred_disps[i])

    vis_pred_disp_mono = cv2.resize(_pred_disp_mono, (vis_gt_width, vis_gt_height))
    vis_pred_disp_z = cv2.resize(_pred_disp_z, (vis_gt_width, vis_gt_height))

    vis_pred_depth_z = 1 / vis_pred_disp_z
    vis_pred_depth_mono = 1 / vis_pred_disp_mono

    if opt.eval_split == "eigen":
        mask = np.logical_and(vis_gt_depth > MIN_DEPTH, vis_gt_depth < MAX_DEPTH)
        crop = np.array([0.40810811 * vis_gt_height, 0.99189189 * vis_gt_height,
                        0.03594771 * vis_gt_width,  0.96405229 * vis_gt_width]).astype(np.int32)
        crop_mask = np.zeros(mask.shape)
        crop_mask[crop[0]:crop[1], crop[2]:crop[3]] = 1
        mask = np.logical_and(mask, crop_mask)

    else:
        mask = vis_gt_depth > 0

    pred_depth_z_mask = vis_pred_depth_z[mask]
    pred_depth_mono_mask = vis_pred_depth_mono[mask]

    gt_depth_mask = vis_gt_depth[mask]

    if not opt.disable_median_scaling:
        ratio1 = np.median(gt_depth_mask) / np.median(pred_depth_mono_mask)
        pred_depth_mono_mask *= ratio1
        ratio2 = np.median(gt_depth_mask) / np.median(pred_depth_z_mask)
        pred_depth_z_mask *= ratio2

    vis_pred_depth_mono *= ratio1
    vis_pred_depth_z *= ratio2

    vis_pred_depth_z[vis_pred_depth_z < MIN_DEPTH] = MIN_DEPTH
    vis_pred_depth_z[vis_pred_depth_z > MAX_DEPTH] = MAX_DEPTH
    vis_pred_depth_mono[vis_pred_depth_mono < MIN_DEPTH] = MIN_DEPTH
    vis_pred_depth_mono[vis_pred_depth_mono > MAX_DEPTH] = MAX_DEPTH

    kernel = np.ones((3, 3), np.uint8)
    vis_gt_depth = cv2.dilate(vis_gt_depth, kernel, iterations=1)
    vis_mask = vis_gt_depth > 0

    absrel_map_multi = np.zeros(vis_gt_depth.shape)
    absrel_map_multi[vis_mask] = np.abs(vis_gt_depth[vis_mask]-vis_pred_depth_z[vis_mask]) / vis_gt_depth[vis_mask]
    absrel_map_multi_disp = np.zeros(vis_gt_depth.shape)
    absrel_map_multi_disp[vis_mask] = np.abs(1/vis_gt_depth[vis_mask]-1/vis_pred_depth_z[vis_mask]) / (1/vis_gt_depth[vis_mask])

    absrel_map_mono = np.zeros(vis_gt_depth.shape)
    absrel_map_mono[vis_mask] = np.abs(vis_gt_depth[vis_mask]-vis_pred_depth_mono[vis_mask]) / vis_gt_depth[vis_mask]
    absrel_map_mono_disp = np.zeros(vis_gt_depth.shape)
    absrel_map_mono_disp[vis_mask] = np.abs(1/vis_gt_depth[vis_mask]-1/vis_pred_depth_mono[vis_mask]) / (1/vis_gt_depth[vis_mask])

    error_pil_multi = np.array(numpy_intensitymap_to_pcolor(absrel_map_multi[..., np.newaxis], vmin=0, vmax=0.5,colormap='jet', valid_mask=vis_mask))[:370,:1224,:]
    error_pil_mono = np.array(numpy_intensitymap_to_pcolor(absrel_map_mono[..., np.newaxis], vmin=0, vmax=0.5,colormap='jet', valid_mask=vis_mask))[:370,:1224,:]

    error_pil_multi_disp = np.array(numpy_intensitymap_to_pcolor(absrel_map_multi_disp[..., np.newaxis], vmin=0, vmax=0.5,colormap='jet', valid_mask=vis_mask))[:370,:1224,:]
    error_pil_mono_disp = np.array(numpy_intensitymap_to_pcolor(absrel_map_mono_disp[..., np.newaxis], vmin=0, vmax=0.5,colormap='jet', valid_mask=vis_mask))[:370,:1224,:]

    error_pil_mono = error_pil_mono[np.newaxis,:]
    error_pil_multi = error_pil_multi[np.newaxis,:]

    error_pil_mono_disp = error_pil_mono_disp[np.newaxis,:]
    error_pil_multi_disp = error_pil_multi_disp[np.newaxis,:]

    absrel_maps_mono.append(error_pil_mono)
    absrel_maps_multi.append(error_pil_multi)

    absrel_maps_mono_disp.append(error_pil_mono_disp)
    absrel_maps_multi_disp.append(error_pil_multi_disp)

absrel_maps_mono = np.concatenate(absrel_maps_mono)
absrel_maps_multi = np.concatenate(absrel_maps_multi)

absrel_maps_mono_disp = np.concatenate(absrel_maps_mono_disp)
absrel_maps_multi_disp = np.concatenate(absrel_maps_multi_disp)

Thank you for your wonderful reply。I couldn't find the single-frame variance trained in the code. Can you help me with that? 方差

Sungmin-Woo commented 3 weeks ago

Hello, @jerry-ryu!

We are currently busy with another project, so it might take some time to finalize the code. We expect to update the training code that achieves the performance reported in the paper at least before the ECCV conference, so we kindly ask for your patience. We will leave a comment on this issue once the code is updated.

For now, we recommend conducting experiments on the Cityscapes dataset, where the performance gap is small.

Sungmin-Woo commented 3 weeks ago

@757787182

Thank you for your wonderful reply。I couldn't find the single-frame variance trained in the code. Can you help me with that? 方差

The training code for the single-frame network is not provided, as it is nearly identical to the code provided for training the multi-frame network. You can predict the per-pixel variance of single-frame depth by simply modifying the reprojection loss as shown in Equation 4 of our paper.