Trying to train SynSin on SceneNet database

NagabhushanSN95 commented 3 years ago

Hi, I'm trying to train the SynSin model on SceneNet database. But I'm not able to train the model. I would really appreciate it, if you can give me some tips.

I'm using 2000 pairs of frame only. To be specific, I'm using frames 0,25,3750,3775 from each scene of the first part of the training set which contains 1000 scenes. So, I believe there is considerable amount of diversity.
Also since SceneNet has ground truth depths, I'm using them and bypassing the depth regressor network. For this, I've enabled the --use_gt_depth flag.
In issue-23, it was suggested to use square images only. Since SceneNet has rectangular images (320x240), I'm cropping the frames and depth maps to 240x240. I've modified the camera instrinsic matrix (K) accordingly.
Even after training for 100000 epochs, the model doesn't train at all i.e. I get some red/blue images. That's all. Nothing else. I can understand if prediction fails for testing images, but it is failing for training images itself. PSNR starts from -5, increases to 1 or 2 and then goes to negative again. SSIM doesn't increase beyond 0.05. What do you think would be the problem here?
I tried with both learning rates, the ones metioned in the paper and the ones default in the code. Both didn't work.
I noticed that l1 loss and perceptual loss (content loss) are around 0.7 or 0.8, but GAN loss is an order higher (around 7). So, I set the lambda values for l1 loss and perceptual loss as 10. That didn't help either.
GAN loss starts low (around 5) and keep increasing for around 80000 iterations to upto 12. And then it almost flattens.
I would assume the reason for 7 is that discriminator is training faster than generator. But D_Real and D_Fake have similar values in each batch (around 0.1 to 0.3). So, discriminator isn't training as well.

I don't know what else to try. Can you kindly help me out here?

NagabhushanSN95 commented 3 years ago

This is the command I'm using to start the training

python snb/train.py --batch-size 4 --folder temp --num_workers 4 --resume --dataset scenenet --use_inv_z --accumulation alphacomposite --model_type zbuffer_pts --refine_model_type resnet_256W8UpDown64  --norm_G sync:spectral_batch --render_ids 1 --suffix '' --normalize_image --lr 0.0001 --use_gt_depth --W 240 --log-dir ../Runs/Training/Train01/%s

I wrote DataLoader for SceneNet based on KittiDataLoader. The code is as follows:

import math
from pathlib import Path

import numpy
import skimage.io
import skimage.transform
import torch
import torch.utils.data as data

class SceneNetDataLoader(data.Dataset):

    def __init__(self, split_name, opts=None):
        super(SceneNetDataLoader, self).__init__()
        self.opt = opts
        self.dataroot = Path(opts.dataset_path) / split_name
        self.scenes = []
        for scene_num in sorted(self.dataroot.iterdir()):
            self.scenes.append((scene_num.stem, 0))
            self.scenes.append((scene_num.stem, 3750))

    @staticmethod
    def get_image(path: Path):
        image = skimage.io.imread(path.as_posix()).astype(numpy.float32) / 255 * 2 - 1
        image = image[:, 40:280]                # Crop (240,320,3) to (240,240,3)
        image_tr = torch.from_numpy(image).permute((2, 0, 1))
        return image_tr

    @staticmethod
    def get_depth(path: Path):
        depth = skimage.io.imread(path.as_posix()) * 0.001
        depth = depth[:, 40:280]                # Crop (240,320,3) to (240,240,3)
        depth = depth[None]
        depth = depth.astype(numpy.float32)
        return depth

    def get_transformation(self, scene_num, view_num: int):
        transformation_matrix_path = self.dataroot / scene_num / 'TransformationMatrix.txt'
        transformation_matrices = numpy.genfromtxt(transformation_matrix_path.as_posix(), delimiter=',')
        pose_index = view_num // 25
        pose1 = transformation_matrices[pose_index].reshape(4, 4)
        pose2 = transformation_matrices[pose_index + 1].reshape(4, 4)
        trans = numpy.matmul(pose2, numpy.linalg.inv(pose1)).astype(numpy.float32)
        return trans

    @staticmethod
    def camera_intrinsic_transform(vfov=45, hfov=60, pixel_width=320, pixel_height=240):
        """
        Copied from SceneNet
        """
        camera_intrinsics = numpy.zeros((3, 4))
        camera_intrinsics[2, 2] = 1
        camera_intrinsics[0, 0] = (pixel_width / 2.0) / math.tan(math.radians(hfov / 2.0))
        camera_intrinsics[0, 2] = pixel_width / 2.0
        camera_intrinsics[1, 1] = (pixel_height / 2.0) / math.tan(math.radians(vfov / 2.0))
        camera_intrinsics[1, 2] = pixel_height / 2.0
        return camera_intrinsics

    def __getitem__(self, index):
        scene_id = self.scenes[index]
        scene_num, view_num = scene_id

        frame1_path = self.dataroot / scene_num / f'photo/{view_num:04}.jpg'
        frame2_path = self.dataroot / scene_num / f'photo/{view_num + 25:04}.jpg'
        frame1 = self.get_image(frame1_path)
        frame2 = self.get_image(frame2_path)

        frame1_depth_path = self.dataroot / scene_num / f'depth/{view_num:04}.png'
        frame2_depth_path = self.dataroot / scene_num / f'depth/{view_num + 25:04}.png'
        frame1_depth = self.get_depth(frame1_depth_path)
        frame2_depth = self.get_depth(frame2_depth_path)

        trans = self.get_transformation(scene_num, view_num)
        trans_inv = numpy.linalg.inv(trans)
        identity = torch.eye(4)
        intrinsic = self.camera_intrinsic_transform(pixel_height=frame1.shape[1], pixel_width=frame1.shape[2])
        K = numpy.eye(4, dtype=numpy.float32)
        K[:3, :4] = intrinsic
        K_inv = numpy.linalg.inv(K)

        return {'images': [frame1, frame2],
                'depths': [frame1_depth, frame2_depth],
                'cameras': [{'Pinv': identity, 'P': identity, 'K': K, 'Kinv': K_inv},
                            {'Pinv': trans_inv, 'P': trans, 'K': K, 'Kinv': K_inv}]
                }

    def __len__(self):
        return len(self.scenes)

    def toval(self, epoch):
        pass

    def totrain(self, epoch):
        pass

oawiles commented 3 years ago

I think it's probably something with the camera set up -- you should see when it first projects stuff that the noisy results somewhat align with the true images. You can try using the true depths in the code in order to see if the cameras are right (here: https://github.com/facebookresearch/synsin/blob/master/models/z_buffermodel.py#L89).

NagabhushanSN95 commented 3 years ago

Thanks @oawiles. I'm using true depth only. I'll check if warping of features is correct.

oawiles commented 3 years ago

You can also try warping the RGB -- e.g. pass the RGB colours as features. This should be easier to check. Then these should precisely match the other image.

NagabhushanSN95 commented 3 years ago

@oawiles, you were right. The error is during warping only. The output of splatter is just an array of zeros. I believe the error is in the format of transformation, camera matrices and the depth map. Here are some of my findings

The method I've learnt to warp a frame to the view of its next frame, indexes pixel locations as 0,1,2,...W-1. But your remaps them to [-1,1] range. I think this may be causing the problem.
I tried setting R=identity and set translation in x direction only. The warped images from your and my code matched when I set t_x=0.001 for yours and t_x=0.1 for mine. So, there is some mismatch in ranges.
So, can you please tell me how should I change my data (transformation matrix, camera matrix and depth values) so that it gels well with your code?
Anyway, by replacing your warping code (z_buffer_manipulator.py/PtsManipulator/project_pts()) with mine and using positive depth values to platter, the splattered image looks good. It looks a little blurred and objects seem to be enlarged a bit, which I believe is due to splattering. Because of this, I'm not exactly sure if my code is correct. Hence, can you please tell me what changes I've to make to by transformation and other data so that it is in the format expected by SynSin?

Thanks a lot

NagabhushanSN95 commented 3 years ago

@oawiles, you were right. The error is during warping only. The output of splatter is just an array of zeros. The error is in the format of camera matrix. By writing my own transformation code, I'm able to train the SynSin model. But, I'm not able to get your transformation (warping) code to work correctly. I had the camera matrix in the form

With this camera matrix, splatter output was zeros. I changed the camera matrix and removed dependencies on height and width of frame as follows

With this, splatter output is a warped frame, but the transformation doesn't match with the ground truth. Can you suggest what changes I've to make to my camera matrix? In other words, in what format does your code camera matrix to be in?

Thanks a lot

oawiles commented 3 years ago

What is the error? Sometimes how the splattered image looks in comparison to the true image makes it make snese. One thing I notice is that you should use K to make the values between -1,1 which I believe is not what you're doing. Another thing is sometimes you have to flip the Y. Without being able to see the visual results it's hard to guess at the precise problem.

NagabhushanSN95 commented 3 years ago

Hi, I've attached the images below. This is the first frame (true) frame1

This is the second frame (true) frame2

This is the first frame warped to the view of second frame (splattered) frame2_warped

As you can notice, in the splattered image, the green beam has come down compared to true second frame.

My camera matrix is as below: where hfov=60 and vfov=45.

Also, I had to crop the images from 320x240 to 240x240. Would it make any difference?

oawiles commented 3 years ago

It could make a difference. I would recommend you first try to resize. Otherwise I think the intrinsics would mess it up. It l ooks like it's zoomed in, which could be from the cropping. I'd recommend first resizing and then using a matrix to transform from the intrinsics to [-1,1] for x/y using an offset matrix O such that you have a new intrinsic matrix I = O K where K was your old intrinsic matrix.

NagabhushanSN95 commented 3 years ago

OK. I'll try that. Thanks!

duyguceylan commented 2 years ago

Hi I have similar issues as described in the first message of this thread. I'm trying to train the code on my own dataset. I do save out the warped images using gt depth with the 'use_rgb_features' option set to True and they do look good. However, the model doesn't really train and I continue to get images that are mostly a single color. I tried debugging with only using L1 loss etc. but I observe the same pattern. Do you have any other pointers to what could be the issue?

facebookresearch / synsin

Trying to train SynSin on SceneNet database #24