alexklwong / calibrated-backprojection-network

PyTorch Implementation of Unsupervised Depth Completion with Calibrated Backprojection Layers (ORAL, ICCV 2021)
Other
117 stars 24 forks source link

Kitti static frames #26

Closed DongyangHuLi closed 1 year ago

DongyangHuLi commented 1 year ago

Hey, Alex. I noticed that you used a kitti_static_frames.txt file when processing the dataset. Is the purpose of this to pick out pictures of scenes without moving objects?

alexklwong commented 1 year ago

Hi, the static frames are referring to images without camera motion since the one of the losses relies on image reconstruction which follows structure from motion (SfM).

DongyangHuLi commented 1 year ago

I'm sorry, but I don't seem to follow you. Is the overall unsupervised framework different from unsupervised monocular depth estimation? The unsupervised monocular depth estimation method does not seem to require this. And could you tell me what is your purpose in preprocessing the images and stitching them together into one? Can I not do it this way?

alexklwong commented 1 year ago

Monocular depth also requires this. The "splits" are chosen so that there is always camera motion

For example: https://github.com/nianticlabs/monodepth2/tree/b676244e5a1ca55564eb5d16ab521a48f823af31/splits/eigen_zhou

Preprocessing the images and concatenating is mainly for speed. Loading 3 smaller images is slow whereas loading 1 large image is faster because of random access.

DongyangHuLi commented 1 year ago

Thank you for your prompt repley! Just as you said, '' the 'splits' are chosen so that there is always camera motion ''. But in the last answer, you said '' the static frames are referring to images 'without camera motion' ''. I'm sorry to bother you, but I don't know what I understand wrong.

Hi, the static frames are referring to images without camera motion since the one of the losses relies on image reconstruction which follows structure from motion (SfM).

alexklwong commented 1 year ago

Yes, the splits are images that have camera motion. The static frames are images without camera motion. The static frames are removed here:

https://github.com/alexklwong/calibrated-backprojection-network/blob/master/setup/setup_dataset_kitti.py#L350-L366

For completeness, static frames may contain moving object. So to answer the original question they are not to pick out images with moving objects. They are to remove images that do not have camera motion which is equivalent to choosing images with camera motion (such as the splits used in monocular depth)

DongyangHuLi commented 1 year ago

Yes, the splits are images that have camera motion. The static frames are images without camera motion. The static frames are removed here:

https://github.com/alexklwong/calibrated-backprojection-network/blob/master/setup/setup_dataset_kitti.py#L350-L366

For completeness, static frames may contain moving object. So to answer the original question they are not to pick out images with moving objects. They are to remove images that do not have camera motion which is equivalent to choosing images with camera motion (such as the splits used in monocular depth)

Hi~ alex, thank you very much for your patient answer. Maybe I wasn't clear enough, but I seem to understand what you mean. Allow me to rephrase my question. In my opinion, since unsupervised depth completion and unsupervised monocular depth estimation use the same framework, namely the photometric consistency loss optimization model using reprojection. Then why does monocular depth estimation require camera motion, but depth completion cannot have camera motion(static frame). Can depth completion not directly use the training data of monocular depth estimation (ignoring sparse depth input)?

alexklwong commented 1 year ago

Perhaps there is a misunderstanding. Both require motion. The kitti static frames are frames without motion. Removing them leaves only frames with motion for training.

DongyangHuLi commented 1 year ago

Perhaps there is a misunderstanding. Both require motion. The kitti static frames are frames without motion. Removing them leaves only frames with motion for training.

Oh, I get it. Thank you so much!

DongyangHuLi commented 1 year ago

Monocular depth also requires this. The "splits" are chosen so that there is always camera motion

For example: https://github.com/nianticlabs/monodepth2/tree/b676244e5a1ca55564eb5d16ab521a48f823af31/splits/eigen_zhou

Preprocessing the images and concatenating is mainly for speed. Loading 3 smaller images is slow whereas loading 1 large image is faster because of random access.

@alexklwong. Hi, Alex. I'm sorry to take a moment of your time. If I use your method, but only want single image input (high resolution) without doing the preprocessing of concatenation. The immediate problem that comes to mind is that for the first frame of a video sequence, it does not have the previous frame, and for the last frame, it has no next frame. So how do you ensure that you can read three frames at a time without concatenation? To do this I looked at the data loading code for monodepth2 and it seemed to do a simple read: 9188fb3cf2794f4e0ec61aa4f83a09f i in [-1,0,1]. Strangely, no mistake was made. But when I used your preprocessor script(https://github.com/alexklwong/calibrated-backprojection-network/blob/440f3ded678fc11c86b0c1fd3f5914c0713c607e/setup/setup_dataset_kitti.py#L286-L287), there was a problem with the array out of bounds: image If it's convenient, can you give me some hints? Thank you very much.

alexklwong commented 1 year ago

Are you using my codebase for this? I could not find image2_path = sequence_image_paths[image0_path_idx + 1] in my setup script.

As for monodepth, I am not familiar with their code, but it seems that they are reading from a predefined list:

https://github.com/nianticlabs/monodepth2/blob/b676244e5a1ca55564eb5d16ab521a48f823af31/trainer.py#L114-L139

They are reading from a predefined list. The code that I have provided in the setup is creating that list.

DongyangHuLi commented 1 year ago

Yes, I also want to use your code to create lists. But I changed it a little bit. Actually, image2_path = sequence_image_paths[image0_path_idx + 1] and image2_path = image_paths[image0_path_idx+1] are similar.

alexklwong commented 1 year ago

So I could not reproduce the error. What is the main difference between your code and mine?

Regardless, we will be pushing an update to the codebase for set up to support supervised and unsupervised training. This is in an effort to make the code more accessible to those who are working also in semi-supervised learning.

DongyangHuLi commented 1 year ago

So I could not reproduce the error. What is the main difference between your code and mine?

Regardless, we will be pushing an update to the codebase for set up to support supervised and unsupervised training. This is in an effort to make the code more accessible to those who are working also in semi-supervised learning.

I fixed it. It was my fault. Thank you for your reply!