BlarkLee / MonoPLFlowNet

ECCV 2022, MonoPLFlowNet
MIT License
9 stars 3 forks source link

depth gt is used in evaluation phase? #2

Closed shuxiusuxiu closed 1 year ago

shuxiusuxiu commented 1 year ago

Thanks for your work! I met a problem when reading the source code. In my view, depth estimation rather than depth gt should be the input of scene flow module. feat0_1, feat0_2, feat0_4, feat1_1, feat1_2, feat1_4 = prepare_feat(all_feat_1, all_feat_2, generated_data, fu, fv, cx, cy) output = model(feat0_1, feat0_2, feat0_4, feat1_1, feat1_2, feat1_4, generated_data) The item generated_data is generated from pc1_gt and pc2_gt, so the gt information is used.

BlarkLee commented 1 year ago

Thanks for your interest in our work. This is a good question, we use the ground truth depth in evaluation because our evaluation aims to compare with the previous LiDAR-based works. Because previously, the real scale scene flow evaluation is only done by LiDAR-based approach, where the 3D scene flow ground truth is generated by point clouds of two consecutive frames. To compare with them we also have to compare with the scene flow generated with Lidar points, the starting point should be aligned, otherwise it makes no sense to compare with the LiDAR-based approach.

shuxiusuxiu commented 1 year ago

Thanks! But it seems that I didn't make myself clear.

The scene flow ground truth is generated from two consecutive frames, specifically, in FlyingThings3D dataset, we choose 8192 sample points (depth less then 35m) per pair, and we can obtain the scene flow ground truth by a simple subtraction below. sf_f = pc2[:, :3] - pc1[:, :3]

And we need to do is to estimate the 3d flow of corresponding 8192 points and use LiDAR-based metrics (epe3d and outlier percentage, e.g.) to evaluate the performance.

From my perspective, your work aims to restore real scale scene flow just using consecutive RGB images as the input, that means only RGB images are needed when in inference phase.

When in evaluation phase, just as you explained, we need to compare with the scene flow ground truth which is generated from ground truth point cloud, so that we can test the performance. However, that doesn't mean we should use the ground truth information when evaluating the model. We only need to find the corresponding 8192 sample points, and then output the estimation.

To sum up, in your implementation, the model needs generated_data generated from ground truth point cloud, which will bring in depth ground truth information, which also means the model cannot work when ground truth point clouds are absent. Therefore, I think generated_data should be generated from estimated point clouds which are also aligned with ground truth. This can be done by using the masks which are generated when choosing 8192 sample points. transformed_pc1_est = pc1_est[torch.squeeze(pc_mask,0)][torch.squeeze(mask1, 0)] transformed_pc2_est = pc2_est[torch.squeeze(pc_mask,0)][torch.squeeze(mask2, 0)] _, _, generated_data = gen_func([transformed_pc1_est, transformed_pc2_est]) By doing so, only RGB images are needed when using the model, and we can use LiDAR-based metrics, because the estimation and ground truth are definitely aligned (they share the same mask).

BlarkLee commented 1 year ago

Thanks for your clear explanation! Now I understand your point! I think what you proposed here does make sense. When I was doing this project last year, there was no previous works trying to use lidar-based metrics for rgb inputs, thus no standard evaluation pipeline, so we were trying to make exactly the same condition of evaluation of Lidar-based scene flow work, and yes, for the inference the estimated depth should be used! I hope that future works also add the evaluation standard as you mentioned here! Thanks again for this point!