cbfinn / gps

Guided Policy Search
http://rll.berkeley.edu/gps/
Other
593 stars 239 forks source link

Trajectory optimization not stable #108

Open yongxf opened 5 years ago

yongxf commented 5 years ago

Hi there,

Thanks for your excellent code. I am running your code using my own Mujoco model to do peg hole insertion with algorithm_traj_opt only, (No neural net yet). It seems the first 15 iterations is okay and the trajectory is converging. However, things suddenly become worse after that. The Laplace estimation of the improvement produces a very large value, so the new eta grows very fast. Then the program crushes since Non-PD error happens.

I checked the iLQR paper. It seems there is no Laplace estimation. And the Qtt (combination of Qxx, Qxu, Quu) has a very different form with the equation you wrote in traj_opt_lqr_python.py. The iLQR paper I read is this: https://homes.cs.washington.edu/~todorov/papers/TassaIROS12.pdf

Can you let me know the paper of Laplace estimation implementation and the implementation paper of the iLQR you referred? Appreciate it!

yongxf commented 5 years ago

The instability of iLQR comes from the eta update in iLQR. The eta penalizes one of the KL divergence in iLQR, and is tuned by comparing kl_div with kl_step. The problem is: 1) when mc cost increases ==> new_mult < 1 ==> step decreases (since actual improvement becomes much smaller than predicted improvement and algorithm tries to reduce the step size) 2) step decreases ==> con > 0 (since kl_step = step * kl_base, thus the theoretical bound becomes more strict. You refered kl_step in the code as epsilon, which is not correct since epsilon controls the other KL divergence term) 3) con > 0 ==> eta increases (since more strict constraint on kl divergence makes current kl divergence violated the constraint, so more penalty will be added (i.k.i eta increases)

In summary: when actual cost increases ==> penalization of KL divergence increases.

This is not reasonable, since more effort on KL divergence term will make the loss term becomes even more larger. After several iterations, the robot waived crazy.

The first several iterations is normal though. I guess the scaling of the improvement in new_multi calculation matters.

Any comment on this?