ToughStoneX / Self-Supervised-MVS

Pytorch codes for "Self-supervised Multi-view Stereo via Effective Co-Segmentation and Data-Augmentation"
152 stars 15 forks source link

homo_warping vs inverse_warping #22

Closed TWang1017 closed 1 year ago

TWang1017 commented 1 year ago

Hi, thanks a lot for the amazing work. I am learning the MVS and would like to ask the difference between homo_warping and inverse_warping in your code.

I understand homo_warping warp the feature from a point in one camera to another using homography H ๐‘– (๐‘‘) = ๐‘‘K ๐‘– T ๐‘– T 1 โˆ’1 Kโˆ’1

Could you plz explain the inverse_warping a bit and the function of it in your code?

If the inverse_warping is just the reverse of homography from my understanding, in what situation you use the homo_warping and inverse_warping.

Thanks in advance.

ToughStoneX commented 1 year ago

In fact, the core of homo_warping and inverse_warping are both the homography warping function. (1) To understand it intuitionally, the homo_warping in MVSNet assumes D depth hypothesises/planes, and warp the pixels on source views to reference views given each of the D depth values to the homography warping function. In this way, the feature maps on source views are warped to reference view, and cost volume is built upon the reference feature map and warped feature maps. In analogy, you can replace these feature maps with original images. What if given certain depth values, and warp the source view images to reference view? This is what the inverse_warping function does. You can even replace the inverse_warping function with the homo_warping function by simply substitue the feature maps with images, the depth values with predicted depth maps. In fact, the provided inverse_warping function is based on classic implementation of self-supervision monocular depth estimation.

(2) To understand it from the formulas, please check the following one: D_src (p_src) p_src = K_src T_src T_ref^{-1} K_ref^{-1} D_ref(p_ref) p_ref

The implementation of homo_warping and inverse_warping are both based on this formula because it is easier to implement with matrix multiplication and accelerate with Pytoch's CUDA support. By normalizing D_src (p_src) p_src (assumed to be [x, y, z]) with [x/z, y/z, z/z] = [x', y', 1], we can obtain the p_src based the aforementioned formulas. In summary, given: intrinsic matrices K_ref, K_src, extrinsic matrices T_ref, T_src, predicted depth D_ref, pixel coordinate in reference view p_ref, we can calculate the corresponding pixel coordinate p_src in souce view. Then we can implement the bilinear interpolation on the source view and map these pixels to the reference view based on aforementioned correspondence relationship: p_ref -- p_src.

------------------ ๅŽŸๅง‹้‚ฎไปถ ------------------ ๅ‘ไปถไบบ: "ToughStoneX/Self-Supervised-MVS" @.>; ๅ‘้€ๆ—ถ้—ด: 2022ๅนด11ๆœˆ29ๆ—ฅ(ๆ˜ŸๆœŸไบŒ) ไธ‹ๅˆ4:28 @.>; @.***>; ไธป้ข˜: [ToughStoneX/Self-Supervised-MVS] homo_warping vs inverse_warping (Issue #22)

Hi, thanks a lot for the amazing work. I am learning the MVS and would like to ask the difference between homo_warping and inverse_warping in your code.

I understand homo_warping warp the feature from a point in one camera to another using homography H ๐‘– (๐‘‘) = ๐‘‘K ๐‘– T ๐‘– T 1 โˆ’1 Kโˆ’1

Could you plz explain the inverse_warping a bit and the function of its in your code? Thanks in advance.

โ€” Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

TWang1017 commented 1 year ago

โ€œThen we can implement the bilinear interpolation on the source view and map these pixels to the reference view based on aforementioned correspondence relationship: p_ref -- p_src.โ€

Hi, thanks a lot for the explanation. For the last statement, if I understand correctly: we firstly found the corresponding pixel coordinate p_src in source view, which is continuous and in float. But we want the integer coordinates on images. So, we use bilinear interpolation to find the integer coordinate by considering 4 corners on src view. Then the src view's integer coordinate's RGB information is warped into ref's view to synthesise a new ref image for photo loss cals?

The entire process includes 1. find the corresponding pixel coordinate p_src in source in float. 2. transform the float into integer using bilinear interpolation on src view and map the src integer coordinate's RGB on ref view to facilitate photo loss cals. Two steps combined are considered inverse warping?

Correct me if I am wrong and thanks a lot for your kind explanation.

ToughStoneX commented 1 year ago

Yes, it is correct.

There is one more thing to note, the mask of the synthesized ref image is indispensible. Ortherwise the self-supervised training may collapse.

The reason is that only the common regions among different views are useful, which can be calculated during inverse warping. You may find that the output of the function _bilinear_sample has 2 variables: output, mask. The mask here is used to filter the invalid regions.

------------------ ๅŽŸๅง‹้‚ฎไปถ ------------------ ๅ‘ไปถไบบ: "ToughStoneX/Self-Supervised-MVS" @.>; ๅ‘้€ๆ—ถ้—ด: 2022ๅนด12ๆœˆ6ๆ—ฅ(ๆ˜ŸๆœŸไบŒ) ไธญๅˆ11:00 @.>; @.**@.>; ไธป้ข˜: Re: [ToughStoneX/Self-Supervised-MVS] homo_warping vs inverse_warping (Issue #22)

โ€œThen we can implement the bilinear interpolation on the source view and map these pixels to the reference view based on aforementioned correspondence relationship: p_ref -- p_src.โ€

Hi, thanks a lot for the explanation. For the last statement, if I understand correctly: we firstly found the corresponding pixel coordinate p_src in source view, which is continuous and in float. But we want the integer coordinates on images. So, we use bilinear interpolation to find the integer coordinate by considering 4 corners on src view. Then the integer coordinate's RGB information is warped into ref's view to synthesise a new ref image for photo loss cals?

The entire process includes 1. find the corresponding pixel coordinate p_src in source in float. 2. transform the float into integer using bilinear interpolation on src view and map the coordinate's RGB on ref view to facilitate photo loss cals. Two steps combined are considered inverse warping?

Correct me if I am wrong and thanks a lot for your kind explanation.

โ€” Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

TWang1017 commented 1 year ago

Hi, thanks a lot for your swift response and your reminder helps a lot.

One more thing, I train on the DTU dataset with augmentation and co-seg deactivated. The training loss looks like below, the SSIM loss dominates the standard unsupervised loss based on the default weight [12xself.reconstr_loss (photo_loss) + 6xself.ssim_loss + 0.05xself.smooth_loss]. In this case, is it sensible to change the weight, like reduce the 6xself.ssim_loss to 1xself.ssim_loss such that it is in the similar range with reconstr_loss?

Also, the training seems not steady, it fluctuates a lot. Any clues why this happens? Thanks in advance for your help.

image

TWang1017 commented 1 year ago

Yes, it is correct. There is one more thing to note, the mask of the synthesized ref image is indispensible. Ortherwise the self-supervised training may collapse. The reason is that only the common regions among different views are useful, which can be calculated during inverse warping. You may find that the output of the function _bilinear_sample has 2 variables: output, mask. The mask here is used to filter the invalid regions. โ€ฆ

Sorry to bother you again. Just wanna confirm that the first step estimating the projected p_src coordinates and second step inversely warp the bilinear interpolated pixel's RGB information onto the ref view, both use the same equation you mentioned early. Just the first one cares coordinates and the second one warp the RGB information. They both employs the camera intrinsic and extrinsic matrix and estimated depth. Is it correct they both use the same principle and equation?

D_src (p_src) p_src = K_src T_src T_ref^{-1} K_ref^{-1} D_ref(p_ref) p_ref

Much appreciate your time. thanks

ToughStoneX commented 1 year ago

Yes, you are correct.

------------------ ๅŽŸๅง‹้‚ฎไปถ ------------------ ๅ‘ไปถไบบ: "ToughStoneX/Self-Supervised-MVS" @.>; ๅ‘้€ๆ—ถ้—ด: 2022ๅนด12ๆœˆ7ๆ—ฅ(ๆ˜ŸๆœŸไธ‰) ไธ‹ๅˆ4:31 @.>; @.**@.>; ไธป้ข˜: Re: [ToughStoneX/Self-Supervised-MVS] homo_warping vs inverse_warping (Issue #22)

Yes, it is correct. There is one more thing to note, the mask of the synthesized ref image is indispensible. Ortherwise the self-supervised training may collapse. The reason is that only the common regions among different views are useful, which can be calculated during inverse warping. You may find that the output of the function _bilinear_sample has 2 variables: output, mask. The mask here is used to filter the invalid regions. โ€ฆ

Sorry to bother you again. Just wanna confirm that the first step estimating the projected p_src coordinates and second step inversely warp the bilinear interpolated pixel's RGB information onto the ref view, both use the same equation you mentioned early. Just the first one cares coordinates and the second one warp the RGB information. They both employs the camera intrinsic and extrinsic matrix and estimated depth. Is it correct they both use the same principle and equation?

D_src (p_src) p_src = K_src T_src T_ref^{-1} K_ref^{-1} D_ref(p_ref) p_ref

Much appreciate your time. thanks

โ€” Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>