sanweiliti commented 5 years ago

Hi, The depth predictions are validated with sparse ground truth depths of KITTI here, but there are also other papers validating against full ground truth (filled by interpolating). Will there be a large difference in the validation result between these two methods? Which one is the main criteria nowadays in monocular depth estimation?

ClementPinard commented 5 years ago

Main criteria is from sparse ground truth. The key idea is that interpolated data is not real data, thus comparing your prediction with that does not make much sense.

The interpolation can be used for qualitative result where you can subjectively decide whether your prediction looks like the interpolated ground truth.

The probleme with quantitative result with interpolated data resides in the plane boundaries. between a pixel belonging to a foreground plane (say the car) and the next one on the background, there is a discontinuity, but you don't know for sure where. Interpolate "blurs" the discontinuity and some actually good points from your prediction may appear wrong because the interpolated value is a mid point between fore and background while it should not be.

Hope I was clear enough ! For depth evulation with KITTI, you can look at the first paper using the now usual measurements : https://papers.nips.cc/paper/5539-depth-map-prediction-from-a-single-image-using-a-multi-scale-deep-network.pdf

sanweiliti commented 5 years ago

Hi, @ClementPinard Yes your answer is very clear! Thank you very much! Another question is that some papers use the sparse depth ground truth provided by the KITTI depth benchmark, while this paper use the depth ground truth computed from other information and parameters. Will there be large differences in these two kinds of depth annotations? I can only see from the visualization that these two depth ground truths appear to be very similar.

ClementPinard commented 5 years ago

Officially it is supposed to be the exact same. The depth benchmark is just a ready-to-go depth image instead of LiDar data + calibration.

Now, if you look at other datasets, you can see slight differences, especially with Odometry, where groundtruth pose has probably been smoothed compared to raw data + calibration.

I think it's safe to say that the evaluation is pretty much the same here, because the LiDar and fixed calibration are pretty reliable.

jahaniam commented 5 years ago

I do not suggest validating against interpolated points(Haven't seen anyone doing that) but you can use interpolated to train your network and my experiments shows there is a boost in it if your interpolation is good without weird artifact.

@ClementPinard I have evaluated the "Lidar data + calibration" and the post process KITTI depth data for Eigen split. In reality, as you said they should be exactly same but there are quite different noises and artifacts that affects the LiDAR measurements and LiDAR and post process depth are not same and LiDAR is not reliable measurement. Right now the research of depth estimation for KITTI is at the point where using LiDAR for evaluation should be revisited.

I have also shown if you use ground truth from new Kitti benchmark for training you get a huge performance boost (comparing row one and three)

more info was discussed here:

166#issue-334733737

koutilya-pnvr commented 5 years ago

@a-jahani Please correct if I am wrong. There are two ways to obtain GT depth for the Kitti test setup (eigen split or kitti split doesn't matter): 1) Calibration + Velodyne data (Lidar) or 2) The official ground truth depth images (un interpolated) provided by the Official Kitti providers.

People so far (including this work) used to follow 1) but the idea is to slowly shift towards 2) ?? If I am not wrong there is an interpolated version (completed depth) of the GT depth images by the offiicial Kitti providers, was it for qualitative comparisons only? The sparse (un interpolated) GT depth (obtained either from 1) or 2)) should alone be used for quantitative evaluation like how everyone in this field reported?

jahaniam commented 5 years ago

@koutilya40192 Yes you are right in all of your sentences. evaluating on 1) is not good as the ground truth is wrong 2) is better but still it is not dense so your algorithm might predict very wrong result on those regions while getting good numbers.

There is no interpolated version (completed depth) of the GT depth images by the offiicial Kitti providers. Some researchers interpolate themselfs using different methods and some use the interpolated for visualization only and some use it for training and none (as far as I know) use it for quantitative evaluation. For quantitative evaluation it's either 1) or 2) and I suggest use 2) and submit your result to the benchmark.

ClementPinard commented 5 years ago

The interpolated point is not good, especially because of depth discontinuities between fore and background The interpolated will be very wrong. The only way to have a dense ground truth is to interpolated the 3D point cloud to have a mesh and then to project the mesh to the camera frame.

However you then need a much denser 3D point, and from different points of view, because here we only have the POV of the car.

It's going to take some work to have a truly dense depth ground truth enabled dataset to validate these algorithms with.

As for applying the 2), I'll see what we can do to provide a script that does exactly that, be it only indicating from the README where to get the data or a brand new testing script.

koutilya-pnvr commented 5 years ago

Thanks for your responses @a-jahani and @ClementPinard . That really clears a lot of my doubts.

ClementPinard / SfmLearner-Pytorch

difference in validation with sparse ground truth and filled ground truth of depth #46

https://github.com/mrharicot/monodepth/issues/166#issue-334733737