Training did not converge

HTLife commented 6 years ago

Recently, I tried to implement VINet[1] and open source it to GitHub HTLife/VINet

I already complete whole network structure, but the network can't converge properly in training. How could I fix this problem?

Possible problems & solutions:

The dataset is too challenging: I'm using the EuRoC MAV dataset, which is more challenging than the KITTI VO Dataset used by the DeepVO, Vinet(because the KITTI vehicle image does not shake up and down). NN cannot learn camera movement correctly.
Loss function: L1 loss is been used and identical to the design in [1]. (I'm not very confident about whether I understand the loss design in [1] currently.) Related code
Other hyperparameter problems

Chinese translation

VINet 實作，訓練無法收斂，我的網路設計是否有誤？最近嘗試復現了VINet[1]並將其開源至GitHub HTLife/VINet

目前己經完整整體架構，但在訓練上一直無法正確收斂，想詢問問題出在哪？推測可能的問題：

資料集太有挑戰性：資料集目前用的是EuRoC MAV dataset，比起DeepVO、VINet等論文使用的KITTI VO Dataset更有挑戰性(因為KITTI車載影像不太會有上下方向晃動)，網路無法正確學習到相機移動
Loss function：在HTLife/VINet main.py 中(https://github.com/HTLife/VINet/blob/master/main.py#L210)，目前以L1Loss作為計算方式，加總[1]中提到的兩種loss，這裡我的理解可能不夠充份而實作錯誤。
其它超參數問題

lrxjason commented 6 years ago

How long to train the EuRoc dataset? Firstly, I try VINet and I found it difficult to converge than I try the DeepVo. It also can't converge. The loss function I use L = pose error + 100 * angle error.

HTLife commented 6 years ago

@lrxjason current version 60d21c7 won't converge. Which DeepVO GitHub repo did you use?

This implementation of VINet is still under tuning. The current version 60d21c7 still have some part didn't follow the original paper.
Now I'm trying to convert (x y z quaternion ) format to se3(3) and change the loss function, see if this works.

networkdetail

lrxjason commented 6 years ago

@HTLife I use this https://github.com/sladebot/deepvo/tree/master/src , but I change the training batch part.

I try both the flownetS and CNN. Both of them do not converge.

For the angle, I try pitch roll yaw and qw qx qy qz. But I get the same un-convergence result. And for training 4000 images, I need almost 1 hour for 1 epoch.

HTLife commented 6 years ago

As VINet mentioned, they found it hard to converge by simply training on the frame to frame pose change. They take accumulated global pose (pose related to the starting point) into account to help NN to converge. I'll try this idea tomorrow, and verify if this work or not.

@lrxjason Do you have other suggestion to help this network converge?

lrxjason commented 6 years ago

I try just use CNN part( ignore the LSTM part). It also doesn't converge. I will try to use the PoseNet to train the database tomorrow.

I'm confused about the global pose. If the current camera position is far from the start point which means there is no overlap area how could the global pose work?

From the hardware, I can only set the timestep to 10 which means I can only estimate 10 frames pose. Otherwise, the GPU will out of memory. @HTLife

HTLife commented 6 years ago

About global pose

I think they are not using global pose directly. Instead, they use the "difference" between global pose and accumulated pose. This loss design might reduce the "drift" on estimation.

On page 3999 _042818_011851_am

Code update

I just upload a version with complete SE3 implementation (SHA: 4f14be7bd5a5163dc0b9a41e1ffa9473f5817758).

This is the implementation detail:

Because of the xyzQuaternion feedback design, now the training can only do SGD with batch size 1.

I'm now training with this new implementation and adjusting loss detail.

By the way, @lrxjason did you got any progress on PoseNet?

HTLife commented 6 years ago

"SE3 composition layer" been mentioned in VINet might be related to gvnn. Since the PyTorch implementation is not available (gvnn is implement in torch), I replace the "SE3 composition layer" by SE3 Matrix multiplication.

VINet did not mention the detail of SE3 composition layer, but the related description could be found in [1] (publish by same Lab).

I do not understand the difference between training "directly on se3" and "SE3", and how would that affect the convergence of training.

[1] S.Wang, R.Clark, H.Wen, andN.Trigoni, “End-to-end, sequence-to-sequence probabilistic visual odometry through deep neural networks,” Int. J. Rob. Res., 2017.

copark86 commented 6 years ago

@HTLife I have look at the paper and your code again. It seems like the short batch trajectory structure is important as described in Figure4 of the original paper. SE3 accumulate error as time goes and it will hinder the convergence. It seems like they divided the trajectory into short fragments with different initial poses. Also to keep the continuity they are passing RNN hidden state of the previous batch to the next batch. Maybe it could be the cause of the problem. What do you think?

By the way, how are you debugging the code in the docker? I started python and docker a few days ago. Any advice will be appreciated.

HTLife commented 6 years ago

@copark86 I haven't noticed about passing RNN hidden state to next batch! That might be important. Also, I'll find some time to replace "matrix multiplication implementation of acc SE3" with "SE3 composition layer". See if this works.

copark86 commented 6 years ago

@HTLife I just found that read_R6TrajFile is reading relative pose as 1x6 vector whereas the saved relative pose file is actually written in quaternion format. Is sampleGT_relative.py right code to generate relative pose?

copark86 commented 6 years ago

@HTLife Regarding the input of the IMU RNN, I guess it should include every IMU data between two image frames but it seems like your code is loading only 5 nearby IMU data. Please, correct me if I am wrong.

HTLife commented 6 years ago

About the magic number of reading IMU data

@copark86 Sorry for that magic number. The EuRoC MAV dataset synchronized the camera and imu time. There is one imu record between two images.

(Left: image timestamp, Right: IMU data.csv) _050818_020249_pm

What I did is to feed the RNN with a longer IMU sequence (arbitrary length 5) rather than 3 (image1 time, middle, image2 time).

HTLife commented 6 years ago

About read_R6TrajFile dimension

@copark86
The corresponding part is been showed in the following figure: 39375670-bc711a52-4a81-11e8-9be3-18b45924d0de_050818_021121_pm

I calculate the relative pose (x y z ww wx wy wz) from absolute pose (x y z ww wx wy wz)
se3 could be represent in 6 value. Therefore, (x y z ww wx wy wz) should convert to R^6, and the output dimension of VINet should also be 6 rather than 7.

I found I read the wrong file. It should be:

self.trajectory_relative = self.read_R6TrajFile('/vicon0/sampled_relative_R6.csv')

copark86 commented 6 years ago

@HTLife Regarding IMU data, camera is 20Hz and IMU is 200Hz. The training dataset has 2280 images, 22801 imu measurements which mean there should be 10 imu measurements between two consecutive images. If there are only two imu data between two images as you said, does it mean that you are pre-integrating the IMU data?

Reading sampled_relative_R6 instead of sampled_relative make perfect sense. Thanks for the explanation.

HTLife commented 6 years ago

@copark86 Let me redescribe it.

IMU => no sampling

We should use the raw data of IMU(x,y,z,wx,wy,wz). The IMU sequence length could be span for 8 images (arbitrary number). We have 70 IMU record between 8 images. So the magic number should change from 5 to 70.

(I assume one red box corresponding to one image) _050818_033214_pm

VICON => sampled to camera rate

The ground truth is VICON, which is 40Hz (twice the frequency of stereo camera 20FPS)

HTLife commented 6 years ago

@copark86 VICON R6 conversion code is been updated. e8d72ea

See scripts/convertSampledTrajToR6.py

The original r6 file is incorrect, which convert sampled_relative from x y z ww wx wy wz into x y z so3. scripts/convertSampledTrajToR6.py will convert x y z ww wx wy wz to se3R6

copark86 commented 6 years ago

IMU

@HTLife Thanks for the explanation.

VICON

I just found that the vicon time stamp and IMU stamp is not actually recorded in the same system. (IMU and Camera is in the same system.) How did you find the time difference between the vicon and IMU?

copark86 commented 6 years ago

@HTLife It seems like they used gravity removed acceleration. If the acceleration is not removed eq 10 and 13 is totally wrong. This is important but not mentioned in any part of the paper. What do you think?

HTLife commented 6 years ago

@copark86 The IMU value is not directly connected to the se3 output. The LSTM still have chances output the right value according to the input.

However, having a value closed to 0 might help the convergen speed.

_051118_031732_pm _051118_031749_pm

Also, I found that the official project page of VINet was updated. We could follow the update of slambench2, VINet might be available in this project at the future.

HTLife commented 6 years ago

I found that I misunderstood how to use SE3 composition layer. _061818_030140_pm The orange line is how it should look like.

copark86 commented 6 years ago

I think that is correct although the original figure in the paper describes the current pose is concatenated along with IMU and flownet. There is no reason to concatenate the current pose. If it is not their intention, the figure is misleading the readers.

Does it converge better now?

Months ago, I wrote a new code to regenerated the dataset as I found a possible mistake in your dataset (I will check it again and let you know. it was too long ago.).

HTLife commented 6 years ago

@copark86 I'm porting the SE3 composition layer from gvnn (lua+torch) to PyTorch.
I'll start training again after the porting is complete.

Here is the draft of porting gvnn SE3 => link

HTLife commented 6 years ago

The complete SE3 composition layer implementation is out! (link) I'll start to merge this implementation into VINet.

JesperChristensen89 commented 5 years ago

@HTLife @copark86 Did you manage to get the network to converge with good results?

HTLife commented 5 years ago

@JesperChristensen89 I have no time to focus on this project recently. But if you are interested in continue this work, I'll be willing to engage in the discussion.

HTLife commented 5 years ago

@Adamquhao Cool, are you willing to share it with me? (By sending the pull request)

fangxu622 commented 4 years ago

@copark86 Let me redescribe it.

IMU => no sampling

We should use the raw data of IMU(x,y,z,wx,wy,wz). The IMU sequence length could be span for 8 images (arbitrary number). We have 70 IMU record between 8 images. So the magic number should change from 5 to 70.

(I assume one red box corresponding to one image)

VICON => sampled to camera rate

The ground truth is VICON, which is 40Hz (twice the frequency of stereo camera 20FPS)

Should we resample imu so that aligned with the image?

xuqiwe commented 4 years ago

Recently, I tried to implement VINet[1] and open source it to GitHub HTLife/VINet

I already complete whole network structure, but the network can't converge properly in training. How could I fix this problem?

Possible problems & solutions:

The dataset is too challenging: I'm using the EuRoC MAV dataset, which is more challenging than the KITTI VO Dataset used by the DeepVO, Vinet(because the KITTI vehicle image does not shake up and down). NN cannot learn camera movement correctly.

Loss function: L1 loss is been used and identical to the design in [1]. (I'm not very confident about whether I understand the loss design in [1] currently.) Related code

Other hyperparameter problems

Chinese translation

VINet 實作，訓練無法收斂，我的網路設計是否有誤？最近嘗試復現了VINet[1]並將其開源至GitHub HTLife/VINet

目前己經完整整體架構，但在訓練上一直無法正確收斂，想詢問問題出在哪？推測可能的問題：

資料集太有挑戰性：資料集目前用的是EuRoC MAV dataset，比起DeepVO、VINet等論文使用的KITTI VO Dataset更有挑戰性(因為KITTI車載影像不太會有上下方向晃動)，網路無法正確學習到相機移動

Loss function：在HTLife/VINet main.py 中(https://github.com/HTLife/VINet/blob/master/main.py#L210)，目前以L1Loss作為計算方式，加總[1]中提到的兩種loss，這裡我的理解可能不夠充份而實作錯誤。%EF%BC%8C%E7%9B%AE%E5%89%8D%E4%BB%A5L1Loss%E4%BD%9C%E7%82%BA%E8%A8%88%E7%AE%97%E6%96%B9%E5%BC%8F%EF%BC%8C%E5%8A%A0%E7%B8%BD%5B1%5D%E4%B8%AD%E6%8F%90%E5%88%B0%E7%9A%84%E5%85%A9%E7%A8%AEloss%EF%BC%8C%E9%80%99%E8%A3%A1%E6%88%91%E7%9A%84%E7%90%86%E8%A7%A3%E5%8F%AF%E8%83%BD%E4%B8%8D%E5%A4%A0%E5%85%85%E4%BB%BD%E8%80%8C%E5%AF%A6%E4%BD%9C%E9%8C%AF%E8%AA%A4%E3%80%82)

其它超參數問題

Hello! I would to try the VINet. But I was sucked into the problem of the absence of the dataset. By the way, I have not ever used the docker. How can i get the EuRoC MAV dataset or another dataset like KITTI VO Dataset suited this code. Thank you! My email is xuqw@stu.xidian.edu.cn.

HTLife commented 4 years ago

@xuqiwe I didn't finish this work in the end. But @Adamquhao seems to successfully build train the network.

The following is his instruction:

"Firstly, i recommend u to read "Selective Sensor Fusion for Neural Visual-Inertial Odometry" whose author's department is same as VINet's. This paper leak some details about VINet-like networks. That is the VO features and IMU feature need to be at same size. You can resize the features after VO encoder using a fc layer and concatenate them together. Then u deliver the concatenated feature directly to last lstm(set a suitable length). Finally you get the 6DoF between image pairs by fc layers. The idea is very sample. there are some tricks during pretraining. You firstly pretrain DeepVO decoder(without LSTM) on kitti odometry dataset and use fixed decoder backbone in later experiments(ideas from https://github.com/linjian93/pytorch-deepvo) i did not compare the results with those in VInet, but the loss did converge and i get the suitable results(better than DeepVO only)."

xuqiwe commented 4 years ago

@xuqiwe I didn't finish this work in the end. But @Adamquhao seems to successfully build train the network.

The following is his instruction:

"Firstly, i recommend u to read "Selective Sensor Fusion for Neural Visual-Inertial Odometry" whose author's department is same as VINet's. This paper leak some details about VINet-like networks. That is the VO features and IMU feature need to be at same size. You can resize the features after VO encoder using a fc layer and concatenate them together. Then u deliver the concatenated feature directly to last lstm(set a suitable length). Finally you get the 6DoF between image pairs by fc layers. The idea is very sample. there are some tricks during pretraining. You firstly pretrain DeepVO decoder(without LSTM) on kitti odometry dataset and use fixed decoder backbone in later experiments(ideas from https://github.com/linjian93/pytorch-deepvo) i did not compare the results with those in VInet, but the loss did converge and i get the suitable results(better than DeepVO only)."

Thanks a lot!

xuqiwe commented 4 years ago

@xuqiwe I didn't finish this work in the end. But @Adamquhao seems to successfully build train the network.

The following is his instruction:

"Firstly, i recommend u to read "Selective Sensor Fusion for Neural Visual-Inertial Odometry" whose author's department is same as VINet's. This paper leak some details about VINet-like networks. That is the VO features and IMU feature need to be at same size. You can resize the features after VO encoder using a fc layer and concatenate them together. Then u deliver the concatenated feature directly to last lstm(set a suitable length). Finally you get the 6DoF between image pairs by fc layers. The idea is very sample. there are some tricks during pretraining. You firstly pretrain DeepVO decoder(without LSTM) on kitti odometry dataset and use fixed decoder backbone in later experiments(ideas from https://github.com/linjian93/pytorch-deepvo) i did not compare the results with those in VInet, but the loss did converge and i get the suitable results(better than DeepVO only)." I want to know if @Adamquhao has open sourced his code. Beacause I met with some problem when I reimplement the hard fusion from "Selective Sensor Fusion for Neural Visual-Inertial Odometry" . Thanks in advanced!

Zacon7 commented 11 months ago

About the magic number of reading IMU data

@copark86 Sorry for that magic number. The EuRoC MAV dataset synchronized the camera and imu time. There is one imu record between two images.

(Left: image timestamp, Right: IMU data.csv)

What I did is to feed the RNN with a longer IMU sequence (arbitrary length 5) rather than 3 (image1 time, middle, image2 time).

Well, I think there are 10 instead of 2 imu records between two images. imu

hu-xue commented 7 months ago

About the magic number of reading IMU data

@copark86 Sorry for that magic number. The EuRoC MAV dataset synchronized the camera and imu time. There is one imu record between two images. (Left: image timestamp, Right: IMU data.csv) What I did is to feed the RNN with a longer IMU sequence (arbitrary length 5) rather than 3 (image1 time, middle, image2 time).

Well, I think there are 10 instead of 2 imu records between two images.

@Zacon7 I think the 2 imu records are come from the vicon device(groundtruth), but the 10 imu records you think that maybe come from the imu device(not groundtruth). see this from the description of author

HTLife / VINet