TRI-ML / packnet-sfm

TRI-ML Monocular Depth Estimation Repository
https://tri-ml.github.io/packnet-sfm/
MIT License
1.24k stars 243 forks source link

image resizing #66

Closed tonytu16 closed 4 years ago

tonytu16 commented 4 years ago

Hello,

Thank you for your great work! I am able to train a model on my own dataset and when I run inference on a 720 1280 frame, the output I get is a 768 1280 frame with the original RGB image resized to 384 1280 on the top and the inference result resized to 384 1280 on the bottom. I am wondering if there is a way to get the output image the same size as the input image?

Thank you!

VitorGuizilini-TRI commented 4 years ago

I'm glad you are finding our repository useful!

When running inference (and at training time) you can set the image_shape augmentation parameter to choose the dimensions of the image that will be used as input to the depth and pose networks. The output depth map will be the same resolution, right now there are no option to reshape it back to the original input image.

It should be easy though, I can try to add that in the next update. In the meantime, we have interpolate_image in packnet_sfm/utils/image.py that does exactly that.

tonytu16 commented 4 years ago

Thank you for your response! So if I change the input image size in the config file to say 720 1280 (my original RGB size), then the depth image will also be 720 1280? Therefore the final output image when I run infer.py is the concatenation of the two which should be 1440 * 1280? Thanks!

VitorGuizilini-TRI commented 4 years ago

Yes, it should work like that. There are some limitations to image shape due to the networks (the resolution has to be a multiple of 64, I think), and if you train with a resolution and evaluate with another results will probably change drastically.

tonytu16 commented 4 years ago

Thanks for your answer! I noticed that in KITTI_train.yaml, the input resolution used is 192 640 but the kitti dataset has resolution 384 1280. What is the reasoning behind this disparity? If my own dataset has resolution 1024 1024, is it recommended that I simply use 1024 1024 as the input size? Thank you!

VitorGuizilini-TRI commented 4 years ago

We use 192 x 640 (roughly half of the full resolution) for computational reasons, to speed up training and consume less memory. In our paper we have experiments in full resolution, and results are indeed much better (actually, PackNet is especially designed to excel in such higher resolutions).

So yeah, I agree, if you can fit a 1024 x 1024 image when training, and is comfortable with how long it takes to train, go for it!

tonytu16 commented 4 years ago

Thank you for your response! I tried training three different models with input size 384 1280, 10241024 (my training set resolution) and 768 1280. It looks like the 384 1280 produces the best result. From my understanding. Shouldn't the result get better as the input resolution increases? (P.S. I am using the KITTI dataloader)

1275

1275

1275

VitorGuizilini-TRI commented 4 years ago

Higher resolution should give better results, but is harder to train. Our usual approach is train in lower resolutions and then fine-tune in higher resolutions, can you try that? Another issue could be the camera, have the images been rectified and undistorted? We currently only support a pinhole camera model for self-supervised training.