ScalarExpressionNodeInverseSigmoid taylor expansion issues

uschwes commented 8 years ago

There are currently three issues with the ScalarExpressionNodeInverseSigmoid functionality:

The threshold at which we switch to taylor series expansion does not guarantee that the taylor series can be computed.
The error method does not use the same switch, so at some point the gradient and error computations go out of sync.
The error computation will overflow at some point if the exponent gets too large and the method will return constant zero. This has to be considered when computing the gradient.

IMHO we should compute the error in the gradient method and check for overflow. If it overflows, we return zero. In addition we have to compute a suitable threshold for taylor series expansion based on the scale, shift and height values and use this threshold in both error and gradient computation. This is unfortunately not trivial as we need a function to compute the maximum x value dependent on height, scale and shift at which the gradient computations will not overflow in advance.

I believe these problems make the BFGS optimizer fail from time to time.

uschwes commented 8 years ago

@HannesSommer @pfmark

uschwes commented 8 years ago

is definitely the reason why the probabilistic planner sometimes throws assertions.
and 3. might be the reasons why BFGS reports an error occasionally.

HannesSommer commented 8 years ago

Okay, let's finish this thing :).

Why don't we have the negation (the minus in front of k in https://en.wikipedia.org/wiki/Logistic_function) of the argument and why is it called inverse? Is inverse referring to that argument negation? If yes that is a bad name, because inverse is already reserved for the inverse. Maybe Flipped would have been better. But that is something to make better next time.
To get that right I'd recommend to solve the numerical problem for the most simple part that still has the underlying numerical challenge. IMO that would be x->1/(1 + exp(x)). After we have that function properly implemented together with its Jacobian, it is easy to get the logistic function right, because that only involves chaining with affine linear transformations. To do that with minimal hassle I would keep all the interface we have and just add two functions, oneOverOnePlusExp and oneOverOnePlusExpJacobian and test the separately till they are as good as we need them to be and use them later in the implementation of the ScalarExpressionNodeInverseSigmoid. This will solve all the threshold extra complication we have when considering the additional parameters of the Logistic function.
To solve the overflow problem really good is still probably quite hard. Maybe there are solutions already out there? Maybe for the tanh? (would solve our problem as well, because it almost the same function.
If we can't get that nice, why not switch to atan ? It is probably much simpler to get right and It could even be nicer because it is not that super flat out there. Have you considered that one?

uschwes commented 8 years ago

Very good ideas. tanh looks suitable indeed. atan too. We should try both. I agree that solving the "flipped" sigmoid problem is probably a waste of time. I'll give it a try.

HannesSommer commented 8 years ago

Maybe I wasn't clear about tanh, is is really basically the same function. It is only a affine linear transformation of the original logistic function. So it won't help. Of course it is still nice to have it in the list of AD supported functions. But for the planner it won't help. tanh I only mentioned as an idea to look for in case somebody solved the numerical problems for tanh already. Because then one could use these solutions for our logistic functions. atan is a completely different story and might very well be nicer numerically and maybe even better suited.

HannesSommer commented 8 years ago

Ah, wait, there is a std::tanh implementation already? Then that could be a good implementation already. Then it could actually help. Have you seen the relation to the logistic function? (s. https://en.wikipedia.org/wiki/Logistic_function). You can use it as an drop in replacement because 1 / (1 + exp(-x)) = 1/2 ( 1 + tanh( x / 2 ) ) .

uschwes commented 8 years ago

That makes sense. Have to try. The atan version works okay for planning, but I'm not sure we want the increased slope for larger x. The slope is not enough to push something out of obstacles, but it makes avoidance distance quite sensitive to the parameters of the atan function. Might make more sense to have a steep slope and then switch to linear.

HannesSommer commented 8 years ago

But far away the slope is basically the same ( difference stays smaller one for the unscaled x!). Only shortly after the center point the beginning the atan is a bit steeper. Isn't that what we want?

See http://www.wolframalpha.com/input/?i=atan%28x%29+-+tanh%28x%29 for atan - tanh and http://www.wolframalpha.com/input/?i=1%2F%281%2B+x^2%29+-+1%2Fcosh^2%28x%29 for their derivatives (atan' - tanh').

uschwes commented 8 years ago

What I meant is that for a reasonable gradient inside obstacles the shift parameter must not be too large. That implies that the function has a significant impact on the cost function also outside (but close too) obstacles. We can hardly change the "influence region", since modifying the scale parameter affects the ability to push something out of obstacles. Of course this is the same for the sigmoid, just an argument to look for something non-symmetric.

HannesSommer commented 8 years ago

For testing one could introduce a pre sacle factor for the negative part of the signed distance field. If that is greater one it would get pushed harder out of the obstacles, without changing the slope at non obstacles.

uschwes commented 8 years ago

Oh interesting, that's a cool idea. You mean downscale the negative distances so they get closer to the "step". That's a bit difficult to implement in the distance field, since we use the shift parameter to incorporate the agent radii. So for an agent with center point outside the obstacle but boundary inside, the distance from the grid is positive.

HannesSommer commented 8 years ago

Ah, good point. I forgot about that. It is anyway equivalent to apply a piecewise factor (1 for > 0 and the other larger for < 0 ) to the distance. I was thinking of the distance map mostly for performance. But I also forgot that we need continuous derivatives, don't we? Maybe one should in fact apply a differentiable function with similar graph. Or just go for that linear, differentiable continuation of the positive half of the sigmoid to the negative as you suggested.

uschwes commented 8 years ago

Or something like that http://m.wolframalpha.com/input/?i=100+0.5+%281+%2B+tanh%28-100+x%29%29+2%5E%28-x%29+&x=0&y=0

graph

uschwes commented 8 years ago

Or we use a mutiplicative function consisting of three polynomials: linear if smaller than - eps, cubic if - eps < x < 0 to be able to match derivatives of 1 and zero on both sides and 1 if larger than 0. Basically same as hermite spline interpolation to get the desired function of 1 - x if x negative and 1 otherwise.

uschwes commented 8 years ago

Seems to be reasonable to continue any function having the desired properties on the positive abscissa like a steep slope at zero and zero as x - > inf with a quadratic function on the negative abscissa. This way, optimizers which use a quadratic approximation locally should get out of obstacles very quickly. Yet I observed very large number of iterations for the current BFGS implementation. The reason AFAIK is the following: in the first step of BFGS, the approximate Hessian is identity, thus the first step is steepest descent (with linesearch). Since we have very a large gradient inside obstacles, the linesearch returns a very small step size, which is OK. The update of the approximate Hessian then should actually return the true Hessian for a quadratic function (todo: check that this is true). Now if we'd use a step size of one and we have a pure quadratic function, the next step should bring us to the minimum. Yet in our implementation we use the step size of the first iteration to seed the linesearch in the second iteration that is very small, so we make very small progress in the following iterations. That seems wrong and raises the question how to set the initial step length.

uschwes commented 8 years ago

@helenol hey Helen, do you have something in your framework that can reliably push spline trajectories out of obstacles during gradient based optimization in case the initial guess is in collision? We struggle to find something that works if the trajectory is really going completely through a bigger obstacle.

helenol commented 8 years ago

Reliably is a strong word... It's kind of initialization dependent (if it's 0 gradient at initialization, it really struggles obviously, but otherwise works), but for the most part, yes. The discrete-time approach is CHOMP: http://repository.cmu.edu/cgi/viewcontent.cgi?article=1040&context=robotics Then my paper is the continuous-time version of that based on high-degree polynomial splines: http://helenol.github.io/publications/iros_2016_replanning.pdf

The code is a bit rough at the moment, but it's based on our polynomial trajectory framework: https://github.com/ethz-asl/mav_planning_utils

Let me know if you want more details/hook you up with the code.

HannesSommer commented 8 years ago

This was fixed, wasn't it, @uschwes ?

uschwes commented 8 years ago

At least the numerical issues, yes.

HannesSommer commented 8 years ago

Maybe we should close it here and move whatever is unsolved to new issues wherever appropriate?

uschwes commented 8 years ago

I think the days of the MaxEnt framework are counted, let's just close

ethz-asl / aslam_optimizer

ScalarExpressionNodeInverseSigmoid taylor expansion issues #165