difference of the predicted translation and ground truth vectors

Hello Clement,

First of all, I have to give you kudos for the amazing work you did in this repo.

Coming to the reason that I wrote the issue, I am trying to find the difference between the predicted translation vector and the ground truth translation vector.

Sadly, I cant manage to extract the predicted trasnlation vector from the output of the pose network. I am aware of the ambiquity on the predicted trasnlation vector. Any help to figure out would be really appreciated.

Hi, not sure what you want exactly. If you want the trajectory from the pose vector, you can see how it's done in test_pose : https://github.com/ClementPinard/SfmLearner-Pytorch/blob/master/test_pose.py#L78

Basically, everything is given with respect to the middle frame, so you need to put back the reference to the first frame.

Once it's done, if you want the trajectory for a longer sequence than just 5 frames, you will need to compose the 4x4 matrices so that the first ever frame is the reference (the identity matrix) and all the other matrices are given with respect to the first one.

you translation vectors will be the first 3 rows of the last column.

Hello,

Apologies for not stating my question clearly.

My goal is to include in the loss for training, the difference between the predicted translation vector and ground truth translation vector to see if we can we deal with depth ambiguity with that approach.

I have modified the class SequenceFolder in sequence_folders.py to return the ground truth pose as well for each sample.

In the train function in train.py I have added this code to calculate the loss for the translation vector, with respect to the ground truth

        b = tgt_img.shape[0] 
        reordered_output_poses = torch.cat([pose[:, :poses.shape[1]//2],
                                            torch.zeros(b, 1, 6).to(pose),
                                            pose[:, poses.shape[1]//2:]], dim=1)
        # pose_vec2mat only takes B, 6 tensors, so we simulate a batch dimension of B * seq_length
        unravelled_poses = reordered_output_poses.reshape(-1, 6)
        unravelled_matrices = pose_vec2mat(unravelled_poses, rotation_mode=args.rotation_mode)
        inv_transform_matrices = unravelled_matrices.reshape(b, -1, 3, 4)

        rot_matrices = inv_transform_matrices[..., :3].transpose(-2, -1)
        tr_vectors = -rot_matrices @ inv_transform_matrices[..., -1:]

        loss_4 = torch.sum(gt_transf_matrix[:, :, :, 3] - tr_vectors[:, :, :, 0])
        loss = w1*loss_1 + w2*loss_2 + w3*loss_3 + w4 * loss_4

Unfortunately, I am not sure if the predicted translation vector is calculated with respect to the same frame that its the ground truth translation vector.

Also the code from sequence_folders.py

    def __init__(self, root, seed=None, train=True, sequence_length=3, transform=None, target_transform=None):
        np.random.seed(seed)
        random.seed(seed)
        self.root = Path(root)
        scene_list_path = self.root/'train.txt' if train else self.root/'val.txt'
        self.scenes = [self.root/folder[:-1] for folder in open(scene_list_path)]
        self.transform = transform
        self.crawl_folders(sequence_length)

    def crawl_folders(self, sequence_length):
        sequence_set = []
        demi_length = (sequence_length-1)//2
        shifts = list(range(-demi_length, demi_length + 1))
        shifts.pop(demi_length)
        for scene in self.scenes:
            try:
                poses = np.genfromtxt(scene/'poses.txt').reshape((-1, 3, 4))
                poses_4D = np.zeros((poses.shape[0], 4, 4)).astype(np.float32)
                poses_4D[:, :3] = poses
                poses_4D[:, 3, 3] = 1
            except:
                print("poses.txt was not found in ", scene, "\n skip this sequence")
                self.scenes.remove(scene)
                continue
            intrinsics = np.genfromtxt(scene/'cam.txt').astype(np.float32).reshape((3, 3))
            imgs = sorted(scene.files('*.jpg'))
            assert(len(imgs) == poses.shape[0])
            intrinsics = np.genfromtxt(scene/'cam.txt').astype(np.float32).reshape((3, 3))
            imgs = sorted(scene.files('*.jpg'))
            if len(imgs) < sequence_length:
                continue
            for i in range(demi_length, len(imgs)-demi_length):
                sample = {'intrinsics': intrinsics, 'tgt': imgs[i], 'ref_imgs': [], 'poses': []}
                first_pose = poses_4D[i - demi_length]
                sample['poses'] = (np.linalg.inv(first_pose) @ poses_4D[i - demi_length: i + demi_length + 1])[:, :3]
                for j in shifts:
                    sample['ref_imgs'].append(imgs[i+j])
                sample['poses'] = np.stack(sample['poses'])
                sequence_set.append(sample)
        random.shuffle(sequence_set)
        self.samples = sequence_set

    def __getitem__(self, index):
        sample = self.samples[index]
        tgt_img = load_as_float(sample['tgt'])
        poses = sample['poses']
        ref_imgs = [load_as_float(ref_img) for ref_img in sample['ref_imgs']]
        if self.transform is not None:
            imgs, intrinsics = self.transform([tgt_img] + ref_imgs, np.copy(sample['intrinsics']))
            tgt_img = imgs[0]
            ref_imgs = imgs[1:]
        else:
            intrinsics = np.copy(sample['intrinsics'])
        return tgt_img, ref_imgs, intrinsics, np.linalg.inv(intrinsics), poses

    def __len__(self):
        return len(self.samples)

Ok, thanks for clarifying

By looking at the code I see that you are computing groundtruth poses with respect to the first frame and then computing predicted poses with respect to the target frame.

So think your problem is here, you might want to multiply your inverse matrices by the inverse of the first one so that the first matrix is identity and the others are actual poses

On a more general note, you might want to do the opposite of what you are doing. Instead of computing poses relative to the first of the sequence with a 4D matrix, maybe you can compute the equivalent 6D vectors, and have it with respect to the target frame (usually in the middle) instead of the first one, so that it already match the order outputted by the pose network.

I actually did some of this work with my own DepthNet network where I tested pose supervision https://github.com/ClementPinard/unsupervised-depthnet/blob/master/train_img_pairs.py#L355

If you want to solve the scale problem on Kitti, you might want to have a look at packent-SFM from toyota where they supervise the velocity loss (and thus the depth scale as well) https://github.com/TRI-ML/packnet-sfm/blob/master/packnet_sfm/losses/velocity_loss.py

Thank you very much Clement for you usefull feedback and for the guidance to the paper from Toyota, I was not aware of it!

I assume that when you say the first frame, you are reffering to the first frame of the sequence (by default 3) and not the first frame of the scene if I understand correctly the code?

Checking out your code, the compensate_pose transforms a transformation matrix with respect to another transformation matrix. Therefore, can I use it in the train.py as below (which is the modified version according to your comments of the code that I have already posted) ?

        reordered_output_poses = torch.cat([pose[:, :poses.shape[1]//2],
                                            torch.zeros(b, 1, 6).to(pose),
                                            pose[:, poses.shape[1]//2:]], dim=1)
        # pose_vec2mat only takes B, 6 tensors, so we simulate a batch dimension of B * seq_length
        unravelled_poses = reordered_output_poses.reshape(-1, 6)
        unravelled_matrices = pose_vec2mat(unravelled_poses, rotation_mode=args.rotation_mode)
        inv_transform_matrices = unravelled_matrices.reshape(b, -1, 3, 4)

        rot_matrices = inv_transform_matrices[..., :3].transpose(-2, -1)
        tr_vectors = -rot_matrices @ inv_transform_matrices[..., -1:]

        new_gt_transf_matrix = compensate_pose(inv(gt_transf_matrix), inv(tgt_img)) # Here is the only modification
        loss_4 = torch.sum(new_gt_transf_matrix[:, :, :, 3]  - tr_vectors[:, :, :, 0])
        loss = w1*loss_1 + w2*loss_2 + w3*loss_3 + w4 * loss_4

I am really sorry for the many and basic questions. I am very new to the field

Yes, I think that could work that way.

Now the realm of transformation matrix is a dark place where you spend time and time trying to figure out what order you should multiply the matrices and if you need to inverse or not, so I'd advise you to design some basic tests to make sure that it's working properly.

What I did in my case was to reduce the dataset to only one sequence. The model will overfit like crazy but it will show whether the pose supervision loss and the photometric loss are consistent. If you can't get both to be low at the same time, it means there's probably a mistake somewhere.

good luck !

Hello Clement,

Aplogies for reopening the issue after closing it at first place.

Initially, I tried the way that I mentioned, but I figure out that its way more complicated and I tested with training it on one sequence but I didnt see the desired results.

So I tried to implement the approach that you mentioned, about multiplying the inverse matrices with the first inverse matrix of the sequence.

Unfortunatelly, when I trained it only on one sequence, the photometrics loss decreased but not the ego motion error which remained roundly the same among all the epochs (200 in total).

Here is the code that I implemented inside the train function in the script train.py.

  for i, (tgt_img, ref_imgs, intrinsics, intrinsics_inv, gt_poses) in enumerate(train_loader):
        log_losses = i > 0 and n_iter % args.print_freq == 0
        log_output = args.training_output_freq > 0 and n_iter % args.training_output_freq == 0

        # measure data loading time
        data_time.update(time.time() - end)
        tgt_img = tgt_img.to(device)
        ref_imgs = [img.to(device) for img in ref_imgs]
        intrinsics = intrinsics.to(device)

        # compute output
        disparities = disp_net(tgt_img)
        depth = [1/disp for disp in disparities]
        explainability_mask, pose = pose_exp_net(tgt_img, ref_imgs)

        #========================= Code added for using ego motion as part of the loss ==================
        loss_4 = torch.tensor(0).to(device)
        if args.tr_tv:
            b = tgt_img.shape[0] 
            reordered_output_poses = torch.cat([pose[:, : gt_poses.shape[1]//2],
                                            torch.zeros(b, 1, 6).to(pose),
                                            pose[:,  gt_poses.shape[1]//2:]], dim=1)

            # pose_vec2mat only takes B, 6 tensors, so we simulate a batch dimension of B * seq_length
            unravelled_poses = reordered_output_poses.reshape(-1, 6)
            unravelled_matrices = pose_vec2mat(unravelled_poses, rotation_mode=args.rotation_mode)
            inv_transform_matrices = unravelled_matrices.reshape(b, -1, 3, 4)        

        # 2nd Approach
            for j in range(inv_transform_matrices.shape[0]):
                for k in range(inv_transform_matrices.shape[1]):
                    inv_transform_matrices[j, k, :, :] = inv_transform_matrices[j, k, :, :] * inv_transform_matrices[j, 0 , :, :]
        # End of 2nd approach  

            rot_matrices = inv_transform_matrices[..., :3].transpose(-2, -1)
        # Here are the predictied translation vectors 
            tr_vectors = -rot_matrices @ inv_transform_matrices[..., -1:]  # Predicted vectors
            loss_4 = torch.sum(torch.abs( gt_poses[:, :, :, -1].to(device) - tr_vectors[:, :, :, 0].to(device)))
            loss_4 = loss_4.to(device)
            loss_4 = torch.tensor(loss_4, dtype=torch.float64)
        #========================= Code added for using ego motion as part of the loss ==================

        loss_1, warped, diff = photometric_reconstruction_loss(tgt_img, ref_imgs, intrinsics,
                                                               depth, explainability_mask, pose,
                                                               args.rotation_mode, args.padding_mode)
        if w2 > 0:
            loss_2 = explainability_loss(explainability_mask)
        else:
            loss_2 = 0
        loss_3 = smooth_loss(depth)

        loss = w1*loss_1 + w2*loss_2 + w3*loss_3 + 0.6 * loss_4

I am new to the field so I cant be sure for my implementation. I would really appreciate if you could help me to figure out whats the problem

ClementPinard / SfmLearner-Pytorch

difference of the predicted translation and ground truth vectors #155