Brummi / MonoRec

Official implementation of the paper: MonoRec: Semi-Supervised Dense Reconstruction in Dynamic Environments from a Single Moving Camera (CVPR 2021)
MIT License
587 stars 85 forks source link

Do not understand the difference between Dts and Dt #43

Closed dragon97ytd closed 2 years ago

dragon97ytd commented 2 years ago

Thanks for your outstanding work in this area, it can be applied to a lot of places. But I do not understand one thing in the paper that you write below

"Specifically, for each image, we predict its depth maps Dt and DSt using the cost volumes formed by temporal stereo images C and static stereo images CS, respectively." (The location is in Muti-stage Training --> Bootstrapping --> Mask Module below, if you could not find it you can use search button in your PDF reader to find the sentence)

Can you tell me what does "static stereo images" mean and where are they implemented in your code? By the way, if convenient, can you send me the code about how to generate moving objection mask ? My email is "tingdong.yu@cripac.ia.ac.cn" and you can send the code to it,

Thanks for your great contribution again. best wishes

hardik01shah commented 2 years ago

Hi, According to my understanding of the paper, the term "static stereo images" corresponds to the image given by the second lens of the stereo camera at the same timestamp. It has been mentioned that stereo images are being used during the bootstrapping stage of training. And the KITTI dataset has stereo images i.e. at each timestamp there are two images captured, one from the left and the right camera. Hence, during training these two views are utilized for the formation of the cost volume apart from the temporal stereo images.

Regarding implementation in code, this line in the KITTI dataloader is where the static stereo images are used.

Brummi commented 2 years ago

Hi @dragon97ytd , thank you for your interest in our work! And thank you @hardik01shah for replying, you are right!

Dt is the depth map predicted from a cost volume that was created from the monocular camera sequence (i.e. same camera, different time steps). In the code, these frames are usually just "frames" in the data dict.

DSt is the depth map predicted from a cost volume that was created from the stereo frame only (i.e. different camera, same time step). In the code, this frame is the "stereo_frame" in the data dict.

I also sent you the code via email.

Best, Felix