fabiotosi92 / NeRF-Supervised-Deep-Stereo

A novel paradigm for collecting and generating stereo training data using neural rendering
https://nerfstereo.github.io/
MIT License
348 stars 19 forks source link

Question about 'eraser_transform' augmentation #17

Closed husheng12345 closed 1 year ago

husheng12345 commented 1 year ago

Thank you very much for sharing your excellent work. We are working on implementing code for training stereo networks. According to your paper, the augmentation procedure described in RAFT-Stereo is used for training. We notice there is an augmentation function named eraser_transform in RAFT-Stereo, which erases random regions in the right image.

    def eraser_transform(self, img1, img2):
        ht, wd = img1.shape[:2]
        if np.random.rand() < self.eraser_aug_prob:
            mean_color = np.mean(img2.reshape(-1, 3), axis=0)
            for _ in range(np.random.randint(1, 3)):
                x0 = np.random.randint(0, wd)
                y0 = np.random.randint(0, ht)
                dx = np.random.randint(50, 100)
                dy = np.random.randint(50, 100)
                img2[y0:y0+dy, x0:x0+dx, :] = mean_color

        return img1, img2

We are not sure if this function is confilict with the Triplet Photometric Loss in your paper, which backward-warps right/left image. So our quesion is: 1) Do you use eraser_transform when training RAFT-Stereo? 2) If used, is this function applied to all three images (img0, img1 and img2), or just some of them?

It will also be very helpful if you could share the full augmentation code, thank you.

fabiotosi92 commented 1 year ago

Hi, thank you for your appreciation of our work!

Indeed, we did incorporate the eraser_transform augmentation function during training. However, it's important to highlight some key considerations regarding our augmentation approach.

Specifically, for each triplet (img0, img1, and img2), we generated an augmented triplet using various (either asymmetric or symmetric) data augmentation techniques indicated in the RAFT-Stereo code, resulting in the augmented set (_img0aug, _img1aug, and _img2aug). From this augmented triplet, a stereo pair is selected as input to the deep stereo network (i.e. _(_img1aug, _img2aug)_)

However, when computing the triplet photometric loss, we always adopt the original, unaugmented triplet (img0, img1, and img2). This step is crucial to ensure that the loss computation remains consistent and aligned with the unaltered images.

Our augmentation strategy also includes both horizontal and vertical flipping within the triplets. When applying horizontal flipping, it's important to flip each image horizontally within the triplet and subsequently swap the left and right images, while keeping the central image unchanged. This rule, however, does not apply to vertical flipping, which involves flipping each individual image vertically.

Furthermore, we incorporated the RandomVdisp function into our augmentation process. You can find it at the following link.

husheng12345 commented 1 year ago

Thank you so much for your reply. We will follow your instructions to implement training code.