Inference: own data - Githubissues

sbharadwajj commented 2 years ago

Hi,

I just had a few questions regarding using our own data and running inference using PENet pretrained weights.

1) How sparse can the depth map be? Currently, my inference image is from the Kitti360 dataset which is quite similar to the previous kitti that the network was trained on. But there is no GT depth to sample the depth from. So my sparse depth map is quite sparse. When I run inference on this image, the prediction is also sparse i.e I have prediction only in the regions covered by the sparse depth map. Is this an expected behaviour?

2) What should my input be for 'positions' (i.e the cropped image), I don't want to crop the images for running inference, so should I just set input['positions'] = input['rgb']?

It would be great if you can answer these questions when time permits :)

Regards, Shrisha

JUGGHM commented 2 years ago

Thanks for your interest!

The one-frame sparse depth maps as inputs are of about 5% density of all pixels while the groundtruth depth maps are of 15%. The groundtruth data are generated by accumulating 11 sequential frames. For details of the KITTI Depth dataset, you could refer to [Sparsity Invariant CNNs] by Dr. Uhrig. If you do not want to manually generate groundtruth depth maps for KITTI 360, you could refer to [Self-supervised Sparse-to-Dense: Self-Supervised Depth Completion from LiDAR and Monocular Camera] by Dr. Fangchang Ma for self/un-supervised learning in depth completion.
You do not need to change the 'positions' since they represent positional encodings as prior for computer vision tasks.

Feel free to let me know if you have further questions.

sbharadwajj commented 2 years ago

Thank you for your quick reply.

So the ground truth depth maps with 15% sparsity are used as supervision? Thanks for the references, I will check them immediately.

Current setting: sparse depth map has 5% density of all pixels, but I use the same for supervising the network. So if I understood it correctly, the supervision of the groundtruth depth maps need to be at least 15% correct?

Details of inference: model - PENet_C2 ; penet_accelerated = True ; dilation rate is 2; convolutional-layer-encoding = 'xyz' ; H = 192, W = 704 (I also changed the respective lines here

Here is an example result --> do you think this is the expected result?

Ah, but when I run PENet_C2 to evaluate, what should the input['positions'] be?

JUGGHM commented 2 years ago

Thank you for your quick reply.

So the ground truth depth maps with 15% sparsity are used as supervision? Thanks for the references, I will check them immediately.

Current setting: sparse depth map has 5% density of all pixels, but I use the same for supervising the network. So if I understood it correctly, the supervision of the groundtruth depth maps need to be at least 15% correct?

Details of inference: model - PENet_C2 ; penet_accelerated = True ; dilation rate is 2; convolutional-layer-encoding = 'xyz' ; H = 192, W = 704 (I also changed the respective lines here

Here is an example result --> do you think this is the expected result?

Ah, but when I run PENet_C2 to evaluate, what should the input['positions'] be?

One important feature of GT depth maps is that in some local regions the annotation could be much denser than the average density of 15%. This indicates that the denser the GT depth maps are, the better results the trained model will predict. So there is actually no accurate limitation of GT density.

Using the same maps for supervision is not sufficient. To alleviate this problem, you need to generate denser GT depth maps or deploy un/self-supervised methods.

The positional encodings do not need to be changed. If you don't like them, you could set --co to std when executing training commands. But in our pretrained models, the default settings are used.

sbharadwajj commented 2 years ago

Okay I understand regarding the sparsity for training.

But I am still unsure about what the positional encodings are because I am preparing my own data, so is there a flag for evaluation? I understand --co to std is for training right?

I can see in model.py that u & v coordinates of input['position'] are used, but I am still not sure what the positional encodings are or how they can be created for my own data.

JUGGHM commented 2 years ago

Okay I understand regarding the sparsity for training.

But I am still unsure about what the positional encodings are because I am preparing my own data, so is there a flag for evaluation? I understand --co to std is for training right?

I can see in model.py that u & v coordinates of input['position'] are used, but I am still not sure what the positional encodings are or how they can be created for my own data.

You could refer to [An intriguing failing of convolutional neural networks and the CoordConv solution] by Liu for more details about positional encoding. In our default settings, we use the geometric encoding (ie. 3d coordianates) described in our paper. And the evaluation and training process should share consistent settings.

sbharadwajj commented 2 years ago

I get it now, I was able to create the positional encoding (u, v coordinates using the camera intrinsic). I still get a patchy result like this when I evaluate, are you able to analyze what else may possibly go wrong or maybe sensitive?

(I am just using the pretrained weights to evaluate on this data)

JUGGHM commented 2 years ago

I get it now, I was able to create the positional encoding (u, v coordinates using the camera intrinsic). I still get a patchy result like this when I evaluate, are you able to analyze what else may possibly go wrong or maybe sensitive?

(I am just using the pretrained weights to evaluate on this data)

Intuitively I guess the GT maps are not dense enough for supervised depth completion. GT maps are reuqired to be much denser than the sparse inputs.

sbharadwajj commented 2 years ago

But we dont need GT maps in this case where we just evaluate. The result that I've shared here is not based on fine-tuning but merely test_completion

JUGGHM commented 2 years ago

But we dont need GT maps in this case where we just evaluate. The result that I've shared here is not based on fine-tuning but merely test_completion

I think two points could be taken into consideration:

The sparse depth maps in KITTI360 seem denser than the ones we use in KITTI depth. This means that there exists domain gaps between those two datasets, leading to the failure of predicted results. We suggest that you could (i) Construct denser GT maps in KITTI360 for further training or finetuning. This step is necessary for transfer learning. Or (ii) Consider depth completion methods with "Sparsity Invariance", which aims at countering the instability brought by unknown and varying density. You could refer to [Sparsity Invariant CNNs] by Uhrig or [A Normalized Convolutional Neural Network for Guided Sparse Depth Upsampling] from our group. Recently, this topic has been discussed in [Boosting Monocular Depth Estimation with Lightweight 3D Point Fusion] as well.
A secondary reason is that: Not all pixels in the predicted depth map are reliable. You could refer to a previous issue for this.

JUGGHM / PENet_ICRA2021

Inference: own data #42