cbfinn / gps

Guided Policy Search
http://rll.berkeley.edu/gps/
Other
593 stars 239 forks source link

PILQR #80

Closed zhangmarvin closed 7 years ago

zhangmarvin commented 7 years ago

This PR implements the PILQR algorithm as described in [1] as well as the MDGPS extension. Three experiments are provided: a pointmass example with discontinuous costs, as well as the gripper pusher trajectory optimization and door opening policy optimization experiments from [1].

The main contributions are as follows:

There are various other minor changes that should be relatively self-explanatory given the diffs below, mostly cosmetic and some functional.

Huge thanks to collaborators @Lolu28 and @ychebotar for helping to get this out the door.

[1] Y. Chebotar, et. al. "Combining model-based and model-free updates for trajectory-centric reinforcement learning." arXiv pre-print arXiv:1703.03078.

cbfinn commented 7 years ago

Thanks @zhangmarvin ! I'll probably get to this review this weekend.

zhangmarvin commented 7 years ago

Thanks Chelsea! I'll get to these comments tonight or tomorrow and revise accordingly. Is it good to merge afterwards, or would you (or someone else) want another pass?

Marvin

On Mar 28, 2017 2:48 PM, "Chelsea Finn" notifications@github.com wrote:

@cbfinn requested changes on this pull request.

Here are comments, mostly minor. The code is very clean - thank you!

I think it would be good to have some sort of README that is associated with this method, and explains how to reproduce your simulated experiments in the paper.

(e.g. see the readme in the SSRL branch.)

In python/gps/algorithm/algorithm_mdgps_pilqr.py https://github.com/cbfinn/gps/pull/80#discussion_r108499710:

  • for m in range(self.M):
  • AlgorithmTrajOptPILQR._stepadjust(self, m)
  • self._update_trajectories()
  • S-step

  • self._update_policy()
  • Prepare for next iteration

  • self._advance_iteration_variables()
  • def compute_costs_constrained(self, traj_distr, m, eta, augment=True):
  • """ Compute cost estimates used in the LQR backward pass. """
  • traj_info, traj_distr = self.cur[m].traj_info, traj_distr
  • if not augment:

maybe add a comment here for what this means

In python/gps/algorithm/algorithm_traj_opt_pilqr.py https://github.com/cbfinn/gps/pull/80#discussion_r108500672:

  • m: condition
  • pdpc: previous dynamics-previous controller cost
  • pdcc: previous dynamics-current controller cost
  • cdcc: current dynamics-current controller cost
  • pmc: previous actual cost
  • cmc: current actual cost
  • cur_res:
  • """
  • def _stepadjust(self, m):
  • """
  • Calculate new step sizes.
  • Args:
  • m: Condition
  • """
  • pdpc = self.traj_opt.estimate_cost(

For readability, maybe add comment like "# previous dynamics, previous controller" here

In python/gps/algorithm/algorithm_traj_opt_pilqr.py https://github.com/cbfinn/gps/pull/80#discussion_r108500841:

  • self.cur[m].step_mult = self.prev[m].step_mult
  • elif self._hyperparams['step_rule'] == 'res_percent':
  • cur_res_sum = np.mean(np.sum(np.abs(cur_res), axis=1))
  • act_sum = np.mean(np.sum(np.abs(self.cur[m].cs), axis=1))
  • Print cost details.

  • LOGGER.debug(
  • 'Cur residuals: %f. Residuals percentage: %f',
  • cur_res_sum, cur_res_sum/act_sum*100
  • )
  • if self.traj_opt.cons_per_step:
  • T = cur_res.shape[1]
  • new_mult = np.ones(T)
  • for t in range(T):
  • res = np.mean(np.sum(np.abs(cur_res[:, t:]), axis=1))
  • act = np.mean(np.sum(np.abs(self.cur[m].cs[:, t:]), axis=1))
  • if res/act > self._hyperparams['step_rule_res_ratio_dec']:

nitpic: there should be spaces around the "/"

In python/gps/algorithm/algorithm_traj_opt_pilqr.py https://github.com/cbfinn/gps/pull/80#discussion_r108501232:

  • K[t, :, :].T.dot(ipc[t, :, :]).dot(K[t, :, :]),
  • -K[t, :, :].T.dot(ipc[t, :, :])
  • ]),
  • np.hstack([
  • -ipc[t, :, :].dot(K[t, :, :]), ipc[t, :, :]
  • ])
  • ])
  • fcv[t, :] += eta / (eta + multiplier) * np.hstack([
  • K[t, :, :].T.dot(ipc[t, :, :]).dot(k[t, :]),
  • -ipc[t, :, :].dot(k[t, :])
  • ])
  • return fCm, fcv
  • def compute_costs(self, m, eta, augment=True):
  • return self._compute_costs(self.cur[m].traj_distr, m, eta, augment)

Why not just have a single function that sets traj_distr = self.cur[m].traj_distr at the top?

In python/gps/algorithm/cost/cost_fk_blocktouch.py https://github.com/cbfinn/gps/pull/80#discussion_r108501864:

@@ -0,0 +1,56 @@ +# TODO: This cost function is hard-coded and not easy to generalize.

Is this true? If so, can you put inline comments in parts where it is hard-coded? (in case someone wants to use it for a different scenario)

In python/gps/algorithm/cost/cost_fk_blocktouch.py https://github.com/cbfinn/gps/pull/80#discussion_r108501953:

@@ -0,0 +1,56 @@ +# TODO: This cost function is hard-coded and not easy to generalize. +import copy + +import numpy as np + +from gps.algorithm.cost.config import COST_FK +from gps.algorithm.cost.cost import Cost +from gps.algorithm.cost.cost_utils import get_ramp_multiplier +from gps.proto.gps_pb2 import JOINT_ANGLES, END_EFFECTOR_POINTS, \

  • END_EFFECTOR_POINT_JACOBIANS
  • +class CostFKBlock(Cost):

Can you add a brief description of this cost function? (like the binary cost description)

In python/gps/algorithm/policy_opt/tf_model_example.py https://github.com/cbfinn/gps/pull/80#discussion_r108502377:

@@ -90,6 +90,29 @@ def example_tf_network(dim_input=27, dim_output=7, batch_size=25, network_config return TfMap.init_from_lists([nn_input, action, precision], [mlp_applied], [loss_out]), fc_vars, []

+def tf_network(dim_input=27, dim_output=7, batch_size=25, network_config=None):

Can we just replace example_tf_network with this?

In python/gps/algorithm/traj_opt/config.py https://github.com/cbfinn/gps/pull/80#discussion_r108502519:

@@ -6,13 +6,28 @@

Dual variable updates for non-PD Q-function.

'del0': 1e-4, 'eta_error_threshold': 1e16,

  • 'min_eta': 1e-4,
  • 'min_eta': 1e-8,

How does this change affect other experiments?

In python/gps/algorithm/traj_opt/traj_opt_pi2.py https://github.com/cbfinn/gps/pull/80#discussion_r108502751:

@@ -39,7 +41,8 @@ def init(self, hyperparams): self._covariance_damping = self._hyperparams['covariance_damping'] self._min_temperature = self._hyperparams['min_temperature']

  • def update(self, m, algorithm):
  • def update(self, m, algorithm, use_lqr_actions=False,
  • fixed_eta=None, use_fixed_eta=False, costs=None):

update docstring

In python/gps/algorithm/traj_opt/traj_opt_pi2.py https://github.com/cbfinn/gps/pull/80#discussion_r108502840:

  • def update_pi2(self, samples, costs, mean_old, cov_old):
  • def update_pi2(self, samples, costs, mean_old, cov_old,
  • fixed_eta=None, use_fixed_eta=False):

update docstring

In python/gps/algorithm/traj_opt/traj_opt_pi2.py https://github.com/cbfinn/gps/pull/80#discussion_r108502937:

  • cov_new[t])) / len(cov_old[t])
  • mult = np.power(mult, 1 / self._covariance_damping)
  • cov_new[t] = mult * cov_old[t]
  • Compute covariance inverse and cholesky decomposition.

  • inv_cov_new[t] = sp.linalg.inv(cov_new[t])
  • chol_cov_new[t] = sp.linalg.cholesky(cov_new[t])
  • return mean_new, cov_new, inv_cov_new, chol_cov_new
  • etas = np.zeros(T)
  • del_ = self._hyperparams['del0']
  • if self._hyperparams['pi2_cons_per_step']:
  • del = np.ones(T) * del
  • fail = True

maybe add comment for what fail is.

In python/gps/algorithm/traj_opt/traj_opt_utils.py https://github.com/cbfinn/gps/pull/80#discussion_r108503207:

-def traj_distr_kl(new_mu, new_sigma, new_traj_distr, prev_traj_distr): + +def traj_distr_kl(new_mu, new_sigma, new_traj_distr, prev_traj_distr, tot=True):

update docstring

In python/gps/algorithm/traj_opt/traj_opt_utils.py https://github.com/cbfinn/gps/pull/80#discussion_r108503308:

@@ -83,4 +87,84 @@ def traj_distr_kl(new_mu, new_sigma, new_traj_distr, prev_traj_distr): )

 # Add up divergences across time to get total divergence.
  • return np.sum(kl_div)
  • return np.sum(kl_div) if tot else kl_div
  • +def traj_distr_kl_alt(new_mu, new_sigma, new_traj_distr, prev_traj_distr, tot=True):

add description of what this is, and ideally also a docstring

In python/gps/algorithm/traj_opt/traj_opt_utils.py https://github.com/cbfinn/gps/pull/80#discussion_r108503364:

  • mu, sigma = new_mu[t, :dX], new_sigma[t, :dX, :dX]
  • kl_div[t] = max(
  • 0,
  • 0.5 * (logdet_prev - logdet_new - new_traj_distr.dU +
  • np.sum(np.diag(inv_prev.dot(sig_new))) +
  • k_diff.T.dot(inv_prev).dot(k_diff) +
  • mu.T.dot(K_diff.T).dot(inv_prev).dot(K_diff).dot(mu) +
  • np.sum(np.diag(K_diff.T.dot(inv_prev).dot(K_diff).dot(sigma))) +
  • 2 * k_diff.T.dot(inv_prev).dot(K_diff).dot(mu))
  • )
  • return np.sum(kl_div) if tot else kl_div
  • +def approximated_cost(sample_list, traj_distr, traj_info):

add description and docstring

In python/gps/algorithm/policy_opt/tf_model_example.py https://github.com/cbfinn/gps/pull/80#discussion_r108504271:

@@ -90,6 +90,29 @@ def example_tf_network(dim_input=27, dim_output=7, batch_size=25, network_config return TfMap.init_from_lists([nn_input, action, precision], [mlp_applied], [loss_out]), fc_vars, []

+def tf_network(dim_input=27, dim_output=7, batch_size=25, network_config=None):

  • """
  • An example of how one might want to specify a network in tensorflow.
  • Args:
  • dim_input: Dimensionality of input.
  • dim_output: Dimensionality of the output.
  • batch_size: Batch size.

update docstring

In python/gps/algorithm/traj_opt/traj_opt_lqr_python.py https://github.com/cbfinn/gps/pull/80#discussion_r108505181:

@@ -189,7 +284,10 @@ def backward(self, prev_traj_distr, traj_info, eta, algorithm, m): dU = prev_traj_distr.dU dX = prev_traj_distr.dX

  • traj_distr = prev_traj_distr.nans_like()
  • if self._update_in_bwd_pass:

This PR adds quite a lot of code to the backward function. Would it make sense to make it more modular?

In python/gps/algorithm/traj_opt/traj_opt_lqr_python.py https://github.com/cbfinn/gps/pull/80#discussion_r108505420:

                 if np.any(np.isnan(Fm)) or np.any(np.isnan(fv)):

raise ValueError('NaNs encountered in dynamics!') raise ValueError('Failed to find PD solution even for very \ large eta (check that dynamics and cost are \ reasonably well conditioned)!') return traj_distr, eta +

  • def _conv_check(self, con, kl_step):

maybe add brief description of what this is.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cbfinn/gps/pull/80#pullrequestreview-29534663, or mute the thread https://github.com/notifications/unsubscribe-auth/ACS5_1KktYEVa8BjiEV0tHe4sGUgdiBjks5rqVX3gaJpZM4MjVyF .

cbfinn commented 7 years ago

I'll take another quick pass.

On Tue, Mar 28, 2017 at 2:44 PM, Marvin Zhang notifications@github.com wrote:

Thanks Chelsea! I'll get to these comments tonight or tomorrow and revise accordingly. Is it good to merge afterwards, or would you (or someone else) want another pass?

Marvin

On Mar 28, 2017 2:48 PM, "Chelsea Finn" notifications@github.com wrote:

@cbfinn requested changes on this pull request.

Here are comments, mostly minor. The code is very clean - thank you!

I think it would be good to have some sort of README that is associated with this method, and explains how to reproduce your simulated experiments in the paper.

(e.g. see the readme in the SSRL branch.)

In python/gps/algorithm/algorithm_mdgps_pilqr.py https://github.com/cbfinn/gps/pull/80#discussion_r108499710:

  • for m in range(self.M):
  • AlgorithmTrajOptPILQR._stepadjust(self, m)
  • self._update_trajectories()
  • S-step

  • self._update_policy()
  • Prepare for next iteration

  • self._advance_iteration_variables()
  • def compute_costs_constrained(self, traj_distr, m, eta, augment=True):
  • """ Compute cost estimates used in the LQR backward pass. """
  • traj_info, traj_distr = self.cur[m].traj_info, traj_distr
  • if not augment:

maybe add a comment here for what this means

In python/gps/algorithm/algorithm_traj_opt_pilqr.py https://github.com/cbfinn/gps/pull/80#discussion_r108500672:

  • m: condition
  • pdpc: previous dynamics-previous controller cost
  • pdcc: previous dynamics-current controller cost
  • cdcc: current dynamics-current controller cost
  • pmc: previous actual cost
  • cmc: current actual cost
  • cur_res:
  • """
  • def _stepadjust(self, m):
  • """
  • Calculate new step sizes.
  • Args:
  • m: Condition
  • """
  • pdpc = self.traj_opt.estimate_cost(

For readability, maybe add comment like "# previous dynamics, previous controller" here

In python/gps/algorithm/algorithm_traj_opt_pilqr.py https://github.com/cbfinn/gps/pull/80#discussion_r108500841:

  • self.cur[m].step_mult = self.prev[m].step_mult
  • elif self._hyperparams['step_rule'] == 'res_percent':
  • cur_res_sum = np.mean(np.sum(np.abs(cur_res), axis=1))
  • act_sum = np.mean(np.sum(np.abs(self.cur[m].cs), axis=1))
  • Print cost details.

  • LOGGER.debug(
  • 'Cur residuals: %f. Residuals percentage: %f',
  • cur_res_sum, cur_res_sum/act_sum*100
  • )
  • if self.traj_opt.cons_per_step:
  • T = cur_res.shape[1]
  • new_mult = np.ones(T)
  • for t in range(T):
  • res = np.mean(np.sum(np.abs(cur_res[:, t:]), axis=1))
  • act = np.mean(np.sum(np.abs(self.cur[m].cs[:, t:]), axis=1))
  • if res/act > self._hyperparams['step_rule_res_ratio_dec']:

nitpic: there should be spaces around the "/"

In python/gps/algorithm/algorithm_traj_opt_pilqr.py https://github.com/cbfinn/gps/pull/80#discussion_r108501232:

  • K[t, :, :].T.dot(ipc[t, :, :]).dot(K[t, :, :]),
  • -K[t, :, :].T.dot(ipc[t, :, :])
  • ]),
  • np.hstack([
  • -ipc[t, :, :].dot(K[t, :, :]), ipc[t, :, :]
  • ])
  • ])
  • fcv[t, :] += eta / (eta + multiplier) * np.hstack([
  • K[t, :, :].T.dot(ipc[t, :, :]).dot(k[t, :]),
  • -ipc[t, :, :].dot(k[t, :])
  • ])
  • return fCm, fcv
  • def compute_costs(self, m, eta, augment=True):
  • return self._compute_costs(self.cur[m].traj_distr, m, eta, augment)

Why not just have a single function that sets traj_distr = self.cur[m].traj_distr at the top?

In python/gps/algorithm/cost/cost_fk_blocktouch.py https://github.com/cbfinn/gps/pull/80#discussion_r108501864:

@@ -0,0 +1,56 @@ +# TODO: This cost function is hard-coded and not easy to generalize.

Is this true? If so, can you put inline comments in parts where it is hard-coded? (in case someone wants to use it for a different scenario)

In python/gps/algorithm/cost/cost_fk_blocktouch.py https://github.com/cbfinn/gps/pull/80#discussion_r108501953:

@@ -0,0 +1,56 @@ +# TODO: This cost function is hard-coded and not easy to generalize. +import copy + +import numpy as np + +from gps.algorithm.cost.config import COST_FK +from gps.algorithm.cost.cost import Cost +from gps.algorithm.cost.cost_utils import get_ramp_multiplier +from gps.proto.gps_pb2 import JOINT_ANGLES, END_EFFECTOR_POINTS, \

  • END_EFFECTOR_POINT_JACOBIANS
  • +class CostFKBlock(Cost):

Can you add a brief description of this cost function? (like the binary cost description)

In python/gps/algorithm/policy_opt/tf_model_example.py https://github.com/cbfinn/gps/pull/80#discussion_r108502377:

@@ -90,6 +90,29 @@ def example_tf_network(dim_input=27, dim_output=7, batch_size=25, network_config return TfMap.init_from_lists([nn_input, action, precision], [mlp_applied], [loss_out]), fc_vars, []

+def tf_network(dim_input=27, dim_output=7, batch_size=25, network_config=None):

Can we just replace example_tf_network with this?

In python/gps/algorithm/traj_opt/config.py https://github.com/cbfinn/gps/pull/80#discussion_r108502519:

@@ -6,13 +6,28 @@

Dual variable updates for non-PD Q-function.

'del0': 1e-4, 'eta_error_threshold': 1e16,

  • 'min_eta': 1e-4,
  • 'min_eta': 1e-8,

How does this change affect other experiments?

In python/gps/algorithm/traj_opt/traj_opt_pi2.py https://github.com/cbfinn/gps/pull/80#discussion_r108502751:

@@ -39,7 +41,8 @@ def init(self, hyperparams): self._covariance_damping = self._hyperparams['covariance_damping'] self._min_temperature = self._hyperparams['min_temperature']

  • def update(self, m, algorithm):
  • def update(self, m, algorithm, use_lqr_actions=False,
  • fixed_eta=None, use_fixed_eta=False, costs=None):

update docstring

In python/gps/algorithm/traj_opt/traj_opt_pi2.py https://github.com/cbfinn/gps/pull/80#discussion_r108502840:

  • def update_pi2(self, samples, costs, mean_old, cov_old):
  • def update_pi2(self, samples, costs, mean_old, cov_old,
  • fixed_eta=None, use_fixed_eta=False):

update docstring

In python/gps/algorithm/traj_opt/traj_opt_pi2.py https://github.com/cbfinn/gps/pull/80#discussion_r108502937:

  • cov_new[t])) / len(cov_old[t])
  • mult = np.power(mult, 1 / self._covariance_damping)
  • cov_new[t] = mult * cov_old[t]
  • Compute covariance inverse and cholesky decomposition.

  • inv_cov_new[t] = sp.linalg.inv(cov_new[t])
  • chol_cov_new[t] = sp.linalg.cholesky(cov_new[t])
  • return mean_new, cov_new, inv_cov_new, chol_cov_new
  • etas = np.zeros(T)
  • del_ = self._hyperparams['del0']
  • if self._hyperparams['pi2_cons_per_step']:
  • del = np.ones(T) * del
  • fail = True

maybe add comment for what fail is.

In python/gps/algorithm/traj_opt/traj_opt_utils.py https://github.com/cbfinn/gps/pull/80#discussion_r108503207:

-def traj_distr_kl(new_mu, new_sigma, new_traj_distr, prev_traj_distr): + +def traj_distr_kl(new_mu, new_sigma, new_traj_distr, prev_traj_distr, tot=True):

update docstring

In python/gps/algorithm/traj_opt/traj_opt_utils.py https://github.com/cbfinn/gps/pull/80#discussion_r108503308:

@@ -83,4 +87,84 @@ def traj_distr_kl(new_mu, new_sigma, new_traj_distr, prev_traj_distr): )

Add up divergences across time to get total divergence.

  • return np.sum(kl_div)
  • return np.sum(kl_div) if tot else kl_div
  • +def traj_distr_kl_alt(new_mu, new_sigma, new_traj_distr, prev_traj_distr, tot=True):

add description of what this is, and ideally also a docstring

In python/gps/algorithm/traj_opt/traj_opt_utils.py https://github.com/cbfinn/gps/pull/80#discussion_r108503364:

  • mu, sigma = new_mu[t, :dX], new_sigma[t, :dX, :dX]
  • kl_div[t] = max(
  • 0,
  • 0.5 * (logdet_prev - logdet_new - new_traj_distr.dU +
  • np.sum(np.diag(inv_prev.dot(sig_new))) +
  • k_diff.T.dot(inv_prev).dot(k_diff) +
  • mu.T.dot(K_diff.T).dot(inv_prev).dot(K_diff).dot(mu) +
  • np.sum(np.diag(K_diff.T.dot(inv_prev).dot(K_diff).dot(sigma))) +
  • 2 * k_diff.T.dot(inv_prev).dot(K_diff).dot(mu))
  • )
  • return np.sum(kl_div) if tot else kl_div
  • +def approximated_cost(sample_list, traj_distr, traj_info):

add description and docstring

In python/gps/algorithm/policy_opt/tf_model_example.py https://github.com/cbfinn/gps/pull/80#discussion_r108504271:

@@ -90,6 +90,29 @@ def example_tf_network(dim_input=27, dim_output=7, batch_size=25, network_config return TfMap.init_from_lists([nn_input, action, precision], [mlp_applied], [loss_out]), fc_vars, []

+def tf_network(dim_input=27, dim_output=7, batch_size=25, network_config=None):

  • """
  • An example of how one might want to specify a network in tensorflow.
  • Args:
  • dim_input: Dimensionality of input.
  • dim_output: Dimensionality of the output.
  • batch_size: Batch size.

update docstring

In python/gps/algorithm/traj_opt/traj_opt_lqr_python.py https://github.com/cbfinn/gps/pull/80#discussion_r108505181:

@@ -189,7 +284,10 @@ def backward(self, prev_traj_distr, traj_info, eta, algorithm, m): dU = prev_traj_distr.dU dX = prev_traj_distr.dX

  • traj_distr = prev_traj_distr.nans_like()
  • if self._update_in_bwd_pass:

This PR adds quite a lot of code to the backward function. Would it make sense to make it more modular?

In python/gps/algorithm/traj_opt/traj_opt_lqr_python.py https://github.com/cbfinn/gps/pull/80#discussion_r108505420:

if np.any(np.isnan(Fm)) or np.any(np.isnan(fv)): raise ValueError('NaNs encountered in dynamics!') raise ValueError('Failed to find PD solution even for very \ large eta (check that dynamics and cost are \ reasonably well conditioned)!') return traj_distr, eta +

  • def _conv_check(self, con, kl_step):

maybe add brief description of what this is.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cbfinn/gps/pull/80#pullrequestreview-29534663, or mute the thread https://github.com/notifications/unsubscribe-auth/ACS5_ 1KktYEVa8BjiEV0tHe4sGUgdiBjks5rqVX3gaJpZM4MjVyF .

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/cbfinn/gps/pull/80#issuecomment-289914196, or mute the thread https://github.com/notifications/unsubscribe-auth/ABMlARwXSIOFP2PIBHHRw4iuAvqLLRdaks5rqX8igaJpZM4MjVyF .

zhangmarvin commented 7 years ago

Ready for you to take another look, @cbfinn. Regarding the README: I feel that's a bit unnecessary, since reproducing the results in the paper is as simple as running the mjc_gripper_pusher and mjc_door_opening experiments, i.e., unlike SSRL, there's nothing complicated or unusual compared to normal GPS.

zhangmarvin commented 7 years ago

I've added 'pilqr' to the experiment names and done the remaining cleanup.

cbfinn commented 7 years ago

LGTM!

AswinRetnakumar commented 4 years ago

Hey is there a way to integrate pilqr algorithm with pr2 robot for moving the robot hand the a point by changing the hyperparameter values.