DecaYale / RSLO

Robust Self-supervised LiDAR Odometry via Representative Structure Discovery and 3D Inherent Error Modeling, IEEE Robotics and Automation Letters (RA-L) (presented at ICRA 2022)
43 stars 4 forks source link

Cannot reproduce the results on KITTI odometry dataset #4

Open spirit-man opened 3 months ago

spirit-man commented 3 months ago

First of all, I would like to express my deepest appreciation for making the code available and for your innovative work documented in the paper.

What error have I met: I am currently trying to leverage your code to reproduce your results on KITTI odometry dataset and have encountered some difficulties. I've trained 15 epochs (26,850 steps) and then loss became Nan or Inf. Furthermore, it seems the results are quite different from ground truth trajectory, and losses decreased at first few epochs but then started increasing again. Here are my eval results after 15 epochs: image image image image image image image image

What I've modified for custom settings:

  1. I used cuda 11.2 and corresponding packages in docker, since the docker image in Dockerfile is not available on Docker Hub. The main changes are codes associated with spconv (in models/middle.py), since your version is a custom one and confilcts with those packages conform with cu112: def forward(self, voxel_features, coors, batch_size):

    # assert batch_size==1, "Only support batch_size=1 for now"
    # coors[:, 1] += 1
    coors = coors.int()
    ret = spconv.SparseConvTensor(voxel_features, coors, self.sparse_shape,
                                  batch_size)
    
    # t = time.time()
    # torch.cuda.synchronize()
    
    ret0 = self.middle_conv(ret)
    
    ret = self.middle_conv_tail(ret0)
    # ret0=copy.deepcopy(ret0)
    # ret0.features = ret0.features#.detach()
    # ret1 = ret0.replace_feature(ret0.features.detach())
    cov_pred=self.middle_cov_deconv(ret0)
    
    # cov_pred = eigvec@eigval@eigvec.transpose(-1,-2)
    # cov_pred.features[:,:3]= F.elu(cov_pred.features[:,:3])+1+1e-6
    elu_features = F.elu(cov_pred.features[:,:3]) + 1 + 1e-6
    cov_pred = cov_pred.replace_feature(torch.cat((elu_features, cov_pred.features[:,3:]), dim=1))
    
    # print("spconv forward time", time.time() - t)
    ret = ret.dense()
    
    N, C, D, H, W = ret.shape
    ret = ret.view(N, C * D, H, W)
    
    return ret, cov_pred.features
  2. I've encountered another error in core/losses.py, line 422: sigma=cov_pred[b]+R_pred[b].detach()@cov_pred_assoc@Rpred[b].detach().transpose(-1,-2), sigma may become negative, causing error in line 435: loss = torch.mean(square_diff)+self.reg_weighttorch.mean(0.5torch.log(torch.det(sigma))) Based on my analysis, cov_pred[b] and cov_pred_assoc both comes from function span_cov2, which returns eigvec@eigval@eigvec.transpose(-1,-2), and eigval is a diagonal matrix while eigvec is a rotation matrix (orthogonal). Thus cov_pred[b] and cov_pred_assoc should be symmetric (consider that eigval is diagonal, we can say eigval=E@E.transpose, then we have eigvec@E@E.transpose@eigvec.transpose, which makes its transpose the same as itself). When it comes to R_pred[b].detach()@cov_pred_assoc@R_pred[b].detach().transpose(-1,-2), we can say the result should be symmetric since cov_pred_assoc is symmetric. Then we have the sum of cov_pred[b] and the former result, both symmetric, making the corresponding sigma symmetric theoretically. So sigma should not have negative determinants. I found that cov_pred[b] and cov_pred_assoc are nearly symmetric (norm of difference between matrix and its tranpose is around 1e-10), and R_pred[b].detach() is nearly orthogonal, but R_pred[b].detach()@cov_pred_assoc@R_pred[b].detach().transpose(-1,-2) is not so symmetric (norm approximately 1e-3). Moreover, another writing like sigma=R_pred[b].detach() @ (R_pred[b].detach().transpose(-1,-2)@cov_pred[b]+cov_pred_assoc@R_pred[b].detach().transpose(-1,-2)), which should be the same as the former mathmatically since R is orthogonal, produced a quite different det result (the former equation returns around -1e-6, and the latter around 1e-5). My solution is to copy all params used to calculate sigma as torch.float64 variables and then calculate, and det(sigma) remains positive after that. This maybe a problem about numercial stability.

def span_cov2(cov_param_pred, return_eig_vec=False):

cov_param: Nx7

        cov_param = cov_param_pred.clone()
        cov_param[:,1:2] = cov_param[:,0:1]+cov_param_pred[:,1:2]#!!
        cov_param[:,2:3] = cov_param[:,1:2]+cov_param_pred[:,2:3]#!!
        cov_param[:,3:]  = cov_param[:,3:].clone() / (torch.norm(cov_param_pred[:,3:], dim=-1, keepdim=True)+1e-9)

        eigval=torch.zeros(cov_param.shape[0],9, device=cov_param.device, dtype=cov_param.dtype)
        eigval[:,::4] = cov_param[:,:3]
        eigval=eigval.reshape(-1,3,3) 
        eigvec=kornia.quaternion_to_rotation_matrix(cov_param[:,3:] ) #Nx3x3
        if not return_eig_vec:
            return eigvec@eigval@eigvec.transpose(-1,-2)
        else:
            return eigvec@eigval@eigvec.transpose(-1,-2), eigvec

Questions I'm asking for support:

  1. How to solve Nan losses, and why am I getting such bad results after 15 epochs, while losses itself is quite low? How many epochs is enough to produce your results?
  2. Why does det(sigma) become negative? Can I use torch.float64 to avoid it?
  3. I've noticed that your losses are conventional losses adding a parameter alpha, which makes losses negative. Why do we need an alpha here? Negative losses seem not so intuitive, and what is its actual meaning?
DecaYale commented 2 months ago

Thank you for your interest in our work. I am sorry to hear about the issue you encountered. We tested our training at the time we released the code, and it worked well. As the environment you use is a little bit different from the original one and you have made some changes to the code, you may need to take some additional care of the code. I can give you some advice based on your feedback:

  1. decrease the learning rate a little bit
  2. Adjust the batch size
  3. Check the sanity of spconv after your modifications
  4. tune other parameters if necessary Hopefully, you'll solve this problem soon.