Global and local frames

alessandroscoppio commented 4 years ago

Hi there!

First off, thanks for sharing such beautiful project! For my thesis I am working on basically the same task, but using more sensors (include RGB images as well) and different processing (especially for the LiDAR, that I am processing as a 3D sparse tensor instead of projecting it to 2D). I liked the waty you structured the train logic a lot!

What I would like to ask you tho, is the reason behind having both a global and local reference when calculating the loss (here).

In my project I am planning to use only the local ones (e.g. at each step regress the motion from beginning to stop of the step), I tried using global one only but the results are pretty bad.

Hope to have some nice insight! Cheers

ArashJavan commented 4 years ago

@alessandroscoppio I will try to answer the questions you sent me via LinkedIn.

Q1) how is your loss? when I train only with IMUs (without even the odom net, just simple lstm regressing directly pos and orientation) the 2 parameters sx and sq of the HWS loss "converge" to negative numbers (-8 and -12ish) and the whole loss is negative basically the whole time (dropping to -20)?

First of all more about HWS (Homoscedastic Weighted Sum) and LWS (Linear Weighted Sum) loss can be found here and here. In the case of HWS it's correct that the values are getting negative, you should be actually worried if they don't ;). Since the loss function is not really bounded it's will minimized till some balance between orientation and translation loss is achieved. Plz note that HWS-loss has itself two learnable parameters, namely both variances (s_x, s_q). Somewhere around -20..-30 is also what I get. And indeed in my experience "sq" is most of the time about "3/2" times smaller than "sx", since regressing rotation is much harder than translation. Specially when you also add global-loss (since Q2) to it.

Q2) did use more frames and combinations give that much improvement? So far I am using only 2 steps for the pose regression, and having simulated data I can possibly just generate more training data instead of having more combinations. But I saw you answering on github that you do it to get more global meaning or something like that, could you share your thoughts/experiance?

Well using more frame make sense if you have one of these two cases:

You use LSTM or GRU for IMU and/or Odometry. So the LSTM can learn the spatial-temporal correlation between the frames and IMU measurements.
You additionally use global loss. YOu can think of it as follows: Assuming we have this configuration: combinations: [[0, 1], [1, 2], [2, 3], [3, 4]] This means that we want a sequence of four consecutive frames. A local loss would be in this case just the relative motion between each consecutive frame (f2f) - e.g. T{0,1}, T{1,2} and so on. A solely local loss is in the latest version commented out, see here. One problem of this loss is that the network will basically learn more or less frame-to-frame ICP, which is also affected by continuous drift over time.

In contrast, in global-loss, we incorporate also the relative emotion between the n-th frame and the first frame (frame-0), i.e. T{0,1}, T{0,2}, and so on. This is denoted as (f2g) in the code. But since the network only predicts the relative emotion between two consecutive frames, we need to integrate this prediction over the entire sequence to build-up the global loss. This happens in se3-to-SE3-function. Both losses are summed up here.

But also in my experience, the global loss did not converge well, especially when using LiDAR-feat-net. For this reason, I implemented a sort of Truncated Backpropagation Through Time (TBTT) by just restricting it to 3-frames at the moment, see here. In summary, you can safely increase the combinations to give a longer sequence to LSTM, but independent from that only the first 3-frames are used for global loss, at least at the moment. Maybe it is also better to let this number also be configured in the config file. You are heartily welcome to make a pull-req. :)

Regarding simulated data, well KITTI data is kind of biased, cause the rotations are mostly on a plane around the z-axis (up-ward), so if you have a dataset with more variations on rotation it would be great if you could share it.

Hope I could answer your questions and make a bit more clear how things work in DeepLIO. I am really open and happy about any suggestions and extensions :)

Best regards Arash Javan

alessandroscoppio commented 4 years ago

Hi @ArashJavan , thanks for your reply.

Somewhere around -20..-30 is also what I get.

Did you mean the total loss? Because that is also what I get, but it feels "wrong" to me. Neural nets training is formulated as a minimization problem and that's why the standard loss functions used are always positive (MSE, cross-entropy etc.). Therefore I am worried to see a loss that gets negative, if we were to use just a simple MSE and it would get negative that would mean that we are updating weights such that the difference to groundtruth gets even more.

I am thinking of squaring the whole HWS loss and see how it goes, but I will have a second look at the original paper first, maybe I am missing something..

alessandroscoppio commented 4 years ago

Dear @ArashJavan , while renewing my question on how your final loss is looking like, I would like to ask another question (in the same issue to not pollute your github, but it can be moved if you want to keep things separated): When regressing the orientation, I see that the orientation fully connected layer predicts 3 values and not 4, even though the orientation is expressed as a quaternion. Does this mean that you are only regressing the wx, wy, wz component and calculating the real part (wq)? Because so far I am regressing the full quaternion, and results are actually pretty ok (while the position is still very off), but I can see that this could lead to quaternion that are not unit quaternions. What's your take on this?

ArashJavan commented 4 years ago

@alessandroscoppio , well the relative orientation between two consecutive is predicted not as a quaternion, but as "se(3)", also the Lie-Algebra or log-space of "SO(3). This also why the f2f-ground-truth is also first converted to the log-space. See here. What I have experienced is that optimizing directly on a normalized quaternion is harder than the log-space, since there are no constraints (normalization).

Please take a look at this paper VINet. The loss in DeepLIO is more or less the same as theirs, the conversion from "se(3)" to "SE(3)" to predict the relative emotion between each frame in a sequence w.r.t. the first frame in the sequence -i.e. {T01, T02, T03...T0S}, happens here.

Furthermore combining two losses by adding them (frame-2-frame se(3) + global SE(3)) is inspired by VLOCNet.

ArashJavan / DeepLIO

Global and local frames #3