DanielCoelho112 / synfeal

Synfeal: A Data-Driven Simulator for Camera Localization
10 stars 1 forks source link

Explore loss functions for 6DoF #21

Closed DanielCoelho112 closed 2 years ago

DanielCoelho112 commented 2 years ago

This paper studies the influence of loss functions in localization problems. At the moment we are simply using an MSE but I think we should use functions appropriated to 6DoF poses.

miguelriemoliveira commented 2 years ago

I would recomend slerp for measurong the rotation distance

On Mon, Mar 21, 2022, 19:48 Daniel Coelho @.***> wrote:

This paper https://arxiv.org/abs/1704.00390 studies the influence of loss functions in localization problems. At the moment we are simply using an MSE but I think we should use functions appropriated to 6DoF poses.

— Reply to this email directly, view it on GitHub https://github.com/DanielCoelho112/localization_end_to_end/issues/21, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWTHVTQXCM26V45K3YIHT3VBDHCVANCNFSM5RIXYUDA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

DanielCoelho112 commented 2 years ago

Hi @miguelriemoliveira and @pmdjdias,

From the paper mentioned before, I realized that using the MSE as we were using makes no sense. So far, nothing new.

There are 3 ways to build a loss function that integrates the position and orientation error. First lets defined them individually:

Lx --> Position Loss Lq --> Orientation Loss

image

image

1. Hardcoded weight position and orientation

Here, Beta is a hyperparameter that must be defined. Its function is to balance the weights of the position and orientation

image

2. Learning an optimal weighting

This case is similar to 1, but it has 2 parameters learned by the network. So in this case, we don't have any additional hyperparameters.

image

3. Learning from geometric reprojection error

In this case, we project the points onto the images and then apply the following loss:

pi --> function that projects a 3D point onto an image image

In this way, we are naturally weighting translation and rotation depending on the scene.


I will implement 1 (it will be easy) and 2 (not so easy). Then we can compare both results. About 3 I think it is very nice, the authors said it was really good because it can adapt to different elements of the scene, whereas 1 and 2 are static in that aspect. However, to implement 3 we need both images and point clouds, maybe if we implement #20, we can give it a try, right?

DanielCoelho112 commented 2 years ago

I would recomend slerp for measurong the rotation distance

That is to replace this:

right?

Another thing I want to discuss with you is: why rodrigues angles? I did some research about how to represent orientation angles in regression tasks, and I didnt found any paper that uses rodrigues, they all use quaternions. What are the reasons that led you to use rodrigues in ATOM?

miguelriemoliveira commented 2 years ago

Hi @DanielCoelho112 ,

I use rodrigues because I think that's what they use in the opencv optimizations

https://docs.opencv.org/3.4/d9/d0c/group__calib3d.html#ga61585db663d9da06b68e70cfbf6a1eac

From the little I read it does not have the problems that the Euler angles have.

Never used quaternions, but those could also be an option.

Note: the output of the network with the 4 quaternion values would already be normalized? Because the quaternion norm should be 1 ... does the network already do that or we would have to postprocess?

miguelriemoliveira commented 2 years ago

I will implement 1 (it will be easy) and 2 (not so easy). Then we can compare both results. About 3 I think iAbout 3 I think it is very nice, the authors said it was really good because it can adapt to different elements of the scene, whereas 1 and 2 are static in that aspect. However, to implement 3 we need both images and point clouds, maybe if we implement https://github.com/DanielCoelho112/localization_end_to_end/issues/20, we can give it a try, right?t is very nice, the authors said it was really good because it can adapt to different elements of the scene, whereas 1 and 2 are static in that aspect. However, to implement 3 we need both images and point clouds, maybe if we implement https://github.com/DanielCoelho112/localization_end_to_end/issues/20, we can give it a try, right?

About 1.

I would say the Lx should have a (B-1) multiplying don't you think?

LB = (1-B) Lx + B Lq

About 2.

It is a bit strange for me that the network learns parameters of the loss function, because the loss function is how the network is evaluated. So is the network also optimizing how it is evaluated? That sounds strange ...

About 3.

Not quite sure how we go about doing this ... the only way I see is if your error measure is not comparing the pose that we should have with the pose that the network predicts, but rather the pixel position of some projected points into a camera (the projection depends on the pose) using the ground truth pose and then using the estimated camera pose.

That would be nice, I think you don't even need #20 for that...

In any case, clearly you should not start with this one ...

DanielCoelho112 commented 2 years ago

Hi @miguelriemoliveira,

I use rodrigues because I think that's what they use in the opencv optimizations

Hm, good point...

Never used quaternions, but those could also be an option.

I'm going to change the code to quaternions to be in line with the recent research done. All the losses I've seen have been tested with quaternions, so I think it is safer to use them.

Note: the output of the network with the 4 quaternion values would already be normalized? Because the quaternion norm should be 1 ... does the network already do that or we would have to postprocess?

The networks does not do that. We have to post-process. Check the rotation loss: image

DanielCoelho112 commented 2 years ago

About 1.

I would say the Lx should have a (B-1) multiplying don't you think?

LB = (1-B) Lx + B Lq

No, B is not a value between 0 and 1. The paper said that for indoor environments the best value is from 200 to 700. It is just to balance the weights of the losses, because the position loss is always higher than the rotation loss.

About 2.

It is a bit strange for me that the network learns parameters of the loss function, because the loss function is how the network is evaluated. So is the network also optimizing how it is evaluated? That sounds strange ...

Yes, I thought the same thing... The network could achieve a low loss but fail to achieve a good pose estimation. I'm also skeptical about this loss, but this loss is what they are always using. I've found 3 papers that use this loss.

Not quite sure how we go about doing this ... the only way I see is if your error measure is not comparing the pose that we should have with the pose that the network predicts, but rather the pixel position of some projected points into a camera (the projection depends on the pose) using the ground truth pose and then using the estimated camera pose.

Yes, that is the idea. By comparing the distance between the projected points, using both the real pose and the predicted pose, we can update the weights of the network to predict poses that minimize that pixel distance. This could be a good loss for our multimodal localization to explore later.

So for now, I'll implement 1 and 2, and I'll leave 3 for future work.

miguelriemoliveira commented 2 years ago

Right, there's the normalization.

This could be a good loss for our multimodal localization to explore later.

I think you could do that with the depth camera, so it does not have to be multimodal it can also be unimodal.

miguelriemoliveira commented 2 years ago

So for now, I'll implement 1 and 2, and I'll leave 3 for future work.

Makes sense ...

pmdjdias commented 2 years ago

Really using the network to evaluate the parameter of the loss function used to train the network sounds strange. A little bit like changing the value of the ending condition of a for or a while within the loop... It is probable that "shit happens" :-)! But probably something to discuss face to face (or zoom to zoom)... Seems interesting anyway!

DanielCoelho112 commented 2 years ago

Hi @miguelriemoliveira and @pmdjdias,

The two loss functions are already implemented, see: https://github.com/DanielCoelho112/localization_end_to_end/blob/3d981c9bf5e0f3eceba01d7d0827637189d58711/localbot_localization/src/loss_functions.py#L1-L40

It was easier than I was expecting...

I think you could do that with the depth camera, so it does not have to be multimodal it can also be unimodal.

Yes you're right.

But probably something to discuss face to face (or zoom to zoom)... Seems interesting anyway!

As soon as we have our pipeline fully developed, we will discuss it by looking at numbers :). Only then can we know which one is better.

Moving now to #23. We are reaching the end of the pipeline!