Some confusion about test and inference

LCJHust commented 5 years ago

Hi, after I inferenced the model with file 'local_test.sh' in KITTI dataset(split file is testing.txt) and generate some depth maps, I evaluated the depth maps with gt 'dmap_imgsize' and cannot got the metrics you have released in paper. For example,D1 in paper is 93.15, but I only got 77.76 on your released model "kvnet_kitti.tar". So, I want to ask how to evaluate the depth maps or , what groundtruth you used when evaluating?--'dmap'/'dmap_raw'/'dmap_raw_bilinear_dw'/'dmap_imgsize' or else? Can you release your evaluatoin codes? Thank you! Hope for your reply!

cxlcl commented 5 years ago

It is weird the score is so low. For inference with local_test.sh, did you change the d_max parameter from 5 to 50 , and d_min from 1 to 60 ?

LCJHust commented 5 years ago

Thank you for reply.When I inferenced on KITTI dataset('testing.txt'), I set d_max=60, d_min=0.001, sigma_soft_max=500, also, I run with line 233~235 in test_KNVet.py instead of line 236 in test_KVNet.py[](url). Another strange thing is that, depths generate from these setting is ranging in (1, 3), seems totally wrong. So I scale them to (0.001, 60), then I evaluated them with gt('dmap_imgsize'). Is there anything wrong? Thank you.

cxlcl commented 5 years ago

The output depth range should be [.001, 60] given your parameters. Did you try d_min=1, dmax=60, sigma_soft_max=10, as suggested in the bash script ? The model is trained with sigma_soft_max=10.

LCJHust commented 5 years ago

Thank you.After I changed parameters in the bash script(as suggested d_mi=1, d_max=60, sigma_soft_max=10), the results seems much better. When I test on the split file 'testing.txt', D1 of the first trajectory '2011_09_26_drive_0002_sync' can get to 0.932, but some other trajectories were still weird, such as the second trajectory '2011_09_26_drive_0005_sync' and the 4th '2011_09_26_drive_0020_sync' as well as the 6th, 7th,8th,9th trajectories, the D1 scores are about 0.7~0.8, even 0.630, so the average D1 score in testing split is 0.712. (I didn't rescale the dmaps.) So, is there still anything mistake I made?

cxlcl commented 5 years ago

I've re-run the testing script using the given trained model. But I didn't get any D1 score in the range in 0.7~0.8 Could you post your testing bash script (and upload test_KVNet.py to somewhere, if possible, so that I can run under your settings) ?

LCJHust commented 5 years ago

Thank you very much! I have uploaded partitial code there. When I inferenced, I changed these files only.I tried to load images with size (768, 256), (386, 256), (1252, 375)，then got better results with size（768,256），but still far away from the scores released in paper. Thank you!

cxlcl commented 5 years ago

According to the code, it turns out that the input images are resized to size (768, 356) here, rather than (768,256). The model was trained using image size (768, 256)

Also, in the metric evaluation part, for all compared methods, we evaluate the metrics by averaging over all images, rather than videos. In other words, D1 is calculated by averaging over all images in all videos directly.

In addition, since some of the methods in comparison is scale-ambiguous (e.g. MonoDepth output disparities rather than depth directly), we have mapped the outputs to the metric space (unit = meters) by scaling the output (or inverse of output if the output is disparity) by the ratio between the GT depth and the output. The ratio is calculated once for the first frame per trajectory rather than for individual frames. To make the comparison fair, we perform this rescaling for all methods, including ours. If the compared methods is not scale-ambiguous or only the scale-invariant metric is needed, then we don't need this step.

At last, for all methods, we only compare the depth estimation performance in the valid sensing range (upto 60 meters), since the model is trained with d_max = 60 and we truncated for the output value >60 during inference.

LCJHust commented 5 years ago

Thank you! After I changed the average methods of evalution, and added the ratio of individual frame and truncated the output with d_max=60, I got the D1 score of 0.917. I will then calculate the ratio of the first frame per trajectory and run inference again.(Input shape [768, 356] maybe a mistake when I debug.) I think I had an misunderstanding that, supervision methods don't need to calculate the ratio and perform rescaling so I made these mistakes and didn't get correct results before. Thank you for your patience! Good luck!

cxlcl commented 5 years ago

You are welcome ! BTW, The rescaling is just for fair comparison with other supervision methods with scale ambiguity. It's fine to remove this step.

apxlwl commented 4 years ago

You are welcome ! BTW, The rescaling is just for fair comparison with other supervision methods with scale ambiguity. It's fine to remove this step.

Hi, can you pls update the testing code again?

NVlabs / neuralrgbd

Some confusion about test and inference #10