PILQR - Githubissues

zhangmarvin commented 7 years ago

This PR implements the PILQR algorithm as described in [1] as well as the MDGPS extension. Three experiments are provided: a pointmass example with discontinuous costs, as well as the gripper pusher trajectory optimization and door opening policy optimization experiments from [1].

The main contributions are as follows:

Algorithm files algorithm_mdgps_pilqr.py, algorithm_traj_opt_pilqr.py, and traj_opt_pilqr.py. In addition, traj_opt_lqr_python.py and traj_opt_pi2.py have been modified heavily to support the algorithm detailed in [1].
Experiment files provided in the mjc_disc_cost_example, mjc_gripper_pusher, and mjc_door_opening experiment directories along with the MuJoCo XML files.
Supporting files for new cost functions (note that there is one TODO left here that I believe is out of the scope of this PR), TensorFlow network construction, and KL-divergence computation.

There are various other minor changes that should be relatively self-explanatory given the diffs below, mostly cosmetic and some functional.

Huge thanks to collaborators @Lolu28 and @ychebotar for helping to get this out the door.

[1] Y. Chebotar, et. al. "Combining model-based and model-free updates for trajectory-centric reinforcement learning." arXiv pre-print arXiv:1703.03078.

cbfinn commented 7 years ago

Thanks @zhangmarvin ! I'll probably get to this review this weekend.

zhangmarvin commented 7 years ago

Thanks Chelsea! I'll get to these comments tonight or tomorrow and revise accordingly. Is it good to merge afterwards, or would you (or someone else) want another pass?

Marvin

On Mar 28, 2017 2:48 PM, "Chelsea Finn" notifications@github.com wrote:

@cbfinn requested changes on this pull request.

Here are comments, mostly minor. The code is very clean - thank you!

I think it would be good to have some sort of README that is associated with this method, and explains how to reproduce your simulated experiments in the paper.

(e.g. see the readme in the SSRL branch.)

In python/gps/algorithm/algorithm_mdgps_pilqr.py https://github.com/cbfinn/gps/pull/80#discussion_r108499710:

for m in range(self.M):

AlgorithmTrajOptPILQR._stepadjust(self, m)

self._update_trajectories()

S-step

self._update_policy()

Prepare for next iteration

self._advance_iteration_variables()

def compute_costs_constrained(self, traj_distr, m, eta, augment=True):

""" Compute cost estimates used in the LQR backward pass. """

traj_info, traj_distr = self.cur[m].traj_info, traj_distr

if not augment:

maybe add a comment here for what this means

In python/gps/algorithm/algorithm_traj_opt_pilqr.py https://github.com/cbfinn/gps/pull/80#discussion_r108500672:

m: condition

pdpc: previous dynamics-previous controller cost

pdcc: previous dynamics-current controller cost

cdcc: current dynamics-current controller cost

pmc: previous actual cost

cmc: current actual cost

cur_res:

"""

def _stepadjust(self, m):

"""

Calculate new step sizes.

Args:

m: Condition

"""

pdpc = self.traj_opt.estimate_cost(

For readability, maybe add comment like "# previous dynamics, previous controller" here

In python/gps/algorithm/algorithm_traj_opt_pilqr.py https://github.com/cbfinn/gps/pull/80#discussion_r108500841:

self.cur[m].step_mult = self.prev[m].step_mult

elif self._hyperparams['step_rule'] == 'res_percent':

cur_res_sum = np.mean(np.sum(np.abs(cur_res), axis=1))

act_sum = np.mean(np.sum(np.abs(self.cur[m].cs), axis=1))

Print cost details.

LOGGER.debug(

'Cur residuals: %f. Residuals percentage: %f',

cur_res_sum, cur_res_sum/act_sum*100

)

if self.traj_opt.cons_per_step:

T = cur_res.shape[1]

new_mult = np.ones(T)

for t in range(T):

res = np.mean(np.sum(np.abs(cur_res[:, t:]), axis=1))

act = np.mean(np.sum(np.abs(self.cur[m].cs[:, t:]), axis=1))

if res/act > self._hyperparams['step_rule_res_ratio_dec']:

nitpic: there should be spaces around the "/"

In python/gps/algorithm/algorithm_traj_opt_pilqr.py https://github.com/cbfinn/gps/pull/80#discussion_r108501232:

K[t, :, :].T.dot(ipc[t, :, :]).dot(K[t, :, :]),

-K[t, :, :].T.dot(ipc[t, :, :])

]),

np.hstack([

-ipc[t, :, :].dot(K[t, :, :]), ipc[t, :, :]

])

])

fcv[t, :] += eta / (eta + multiplier) * np.hstack([

K[t, :, :].T.dot(ipc[t, :, :]).dot(k[t, :]),

-ipc[t, :, :].dot(k[t, :])

])

return fCm, fcv

def compute_costs(self, m, eta, augment=True):

return self._compute_costs(self.cur[m].traj_distr, m, eta, augment)

Why not just have a single function that sets traj_distr = self.cur[m].traj_distr at the top?

In python/gps/algorithm/cost/cost_fk_blocktouch.py https://github.com/cbfinn/gps/pull/80#discussion_r108501864:

@@ -0,0 +1,56 @@ +# TODO: This cost function is hard-coded and not easy to generalize.

Is this true? If so, can you put inline comments in parts where it is hard-coded? (in case someone wants to use it for a different scenario)

In python/gps/algorithm/cost/cost_fk_blocktouch.py https://github.com/cbfinn/gps/pull/80#discussion_r108501953:

@@ -0,0 +1,56 @@ +# TODO: This cost function is hard-coded and not easy to generalize. +import copy + +import numpy as np + +from gps.algorithm.cost.config import COST_FK +from gps.algorithm.cost.cost import Cost +from gps.algorithm.cost.cost_utils import get_ramp_multiplier +from gps.proto.gps_pb2 import JOINT_ANGLES, END_EFFECTOR_POINTS, \

END_EFFECTOR_POINT_JACOBIANS

+class CostFKBlock(Cost):

Can you add a brief description of this cost function? (like the binary cost description)

In python/gps/algorithm/policy_opt/tf_model_example.py https://github.com/cbfinn/gps/pull/80#discussion_r108502377:

@@ -90,6 +90,29 @@ def example_tf_network(dim_input=27, dim_output=7, batch_size=25, network_config return TfMap.init_from_lists([nn_input, action, precision], [mlp_applied], [loss_out]), fc_vars, []

+def tf_network(dim_input=27, dim_output=7, batch_size=25, network_config=None):

Can we just replace example_tf_network with this?

In python/gps/algorithm/traj_opt/config.py https://github.com/cbfinn/gps/pull/80#discussion_r108502519:

@@ -6,13 +6,28 @@

Dual variable updates for non-PD Q-function.

'del0': 1e-4, 'eta_error_threshold': 1e16,

'min_eta': 1e-4,

'min_eta': 1e-8,

How does this change affect other experiments?

In python/gps/algorithm/traj_opt/traj_opt_pi2.py https://github.com/cbfinn/gps/pull/80#discussion_r108502751:

@@ -39,7 +41,8 @@ def init(self, hyperparams): self._covariance_damping = self._hyperparams['covariance_damping'] self._min_temperature = self._hyperparams['min_temperature']

def update(self, m, algorithm):

def update(self, m, algorithm, use_lqr_actions=False,

fixed_eta=None, use_fixed_eta=False, costs=None):

update docstring

In python/gps/algorithm/traj_opt/traj_opt_pi2.py https://github.com/cbfinn/gps/pull/80#discussion_r108502840:

def update_pi2(self, samples, costs, mean_old, cov_old):

def update_pi2(self, samples, costs, mean_old, cov_old,

fixed_eta=None, use_fixed_eta=False):

update docstring

In python/gps/algorithm/traj_opt/traj_opt_pi2.py https://github.com/cbfinn/gps/pull/80#discussion_r108502937:

cov_new[t])) / len(cov_old[t])

mult = np.power(mult, 1 / self._covariance_damping)

cov_new[t] = mult * cov_old[t]

Compute covariance inverse and cholesky decomposition.

inv_cov_new[t] = sp.linalg.inv(cov_new[t])

chol_cov_new[t] = sp.linalg.cholesky(cov_new[t])

return mean_new, cov_new, inv_cov_new, chol_cov_new

etas = np.zeros(T)

del_ = self._hyperparams['del0']

if self._hyperparams['pi2_cons_per_step']:

del = np.ones(T) * del

fail = True

maybe add comment for what fail is.

In python/gps/algorithm/traj_opt/traj_opt_utils.py https://github.com/cbfinn/gps/pull/80#discussion_r108503207:

-def traj_distr_kl(new_mu, new_sigma, new_traj_distr, prev_traj_distr): + +def traj_distr_kl(new_mu, new_sigma, new_traj_distr, prev_traj_distr, tot=True):

update docstring

In python/gps/algorithm/traj_opt/traj_opt_utils.py https://github.com/cbfinn/gps/pull/80#discussion_r108503308:

@@ -83,4 +87,84 @@ def traj_distr_kl(new_mu, new_sigma, new_traj_distr, prev_traj_distr): )
 # Add up divergences across time to get total divergence.
return np.sum(kl_div)

return np.sum(kl_div) if tot else kl_div

+def traj_distr_kl_alt(new_mu, new_sigma, new_traj_distr, prev_traj_distr, tot=True):

add description of what this is, and ideally also a docstring

In python/gps/algorithm/traj_opt/traj_opt_utils.py https://github.com/cbfinn/gps/pull/80#discussion_r108503364:

mu, sigma = new_mu[t, :dX], new_sigma[t, :dX, :dX]

kl_div[t] = max(

0,

0.5 * (logdet_prev - logdet_new - new_traj_distr.dU +

np.sum(np.diag(inv_prev.dot(sig_new))) +

k_diff.T.dot(inv_prev).dot(k_diff) +

mu.T.dot(K_diff.T).dot(inv_prev).dot(K_diff).dot(mu) +

np.sum(np.diag(K_diff.T.dot(inv_prev).dot(K_diff).dot(sigma))) +

2 * k_diff.T.dot(inv_prev).dot(K_diff).dot(mu))

)

return np.sum(kl_div) if tot else kl_div

+def approximated_cost(sample_list, traj_distr, traj_info):

add description and docstring

In python/gps/algorithm/policy_opt/tf_model_example.py https://github.com/cbfinn/gps/pull/80#discussion_r108504271:

@@ -90,6 +90,29 @@ def example_tf_network(dim_input=27, dim_output=7, batch_size=25, network_config return TfMap.init_from_lists([nn_input, action, precision], [mlp_applied], [loss_out]), fc_vars, []

+def tf_network(dim_input=27, dim_output=7, batch_size=25, network_config=None):

"""

An example of how one might want to specify a network in tensorflow.

Args:

dim_input: Dimensionality of input.

dim_output: Dimensionality of the output.

batch_size: Batch size.

update docstring

In python/gps/algorithm/traj_opt/traj_opt_lqr_python.py https://github.com/cbfinn/gps/pull/80#discussion_r108505181:

@@ -189,7 +284,10 @@ def backward(self, prev_traj_distr, traj_info, eta, algorithm, m): dU = prev_traj_distr.dU dX = prev_traj_distr.dX

traj_distr = prev_traj_distr.nans_like()

if self._update_in_bwd_pass:

This PR adds quite a lot of code to the backward function. Would it make sense to make it more modular?

In python/gps/algorithm/traj_opt/traj_opt_lqr_python.py https://github.com/cbfinn/gps/pull/80#discussion_r108505420:
                 if np.any(np.isnan(Fm)) or np.any(np.isnan(fv)):
raise ValueError('NaNs encountered in dynamics!') raise ValueError('Failed to find PD solution even for very \ large eta (check that dynamics and cost are \ reasonably well conditioned)!') return traj_distr, eta +

def _conv_check(self, con, kl_step):
maybe add brief description of what this is.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cbfinn/gps/pull/80#pullrequestreview-29534663, or mute the thread https://github.com/notifications/unsubscribe-auth/ACS5_1KktYEVa8BjiEV0tHe4sGUgdiBjks5rqVX3gaJpZM4MjVyF .

cbfinn commented 7 years ago

I'll take another quick pass.

On Tue, Mar 28, 2017 at 2:44 PM, Marvin Zhang notifications@github.com wrote:

Thanks Chelsea! I'll get to these comments tonight or tomorrow and revise accordingly. Is it good to merge afterwards, or would you (or someone else) want another pass?

Marvin

On Mar 28, 2017 2:48 PM, "Chelsea Finn" notifications@github.com wrote:

@cbfinn requested changes on this pull request.

Here are comments, mostly minor. The code is very clean - thank you!

I think it would be good to have some sort of README that is associated with this method, and explains how to reproduce your simulated experiments in the paper.

(e.g. see the readme in the SSRL branch.)

In python/gps/algorithm/algorithm_mdgps_pilqr.py https://github.com/cbfinn/gps/pull/80#discussion_r108499710:

for m in range(self.M):

AlgorithmTrajOptPILQR._stepadjust(self, m)

self._update_trajectories()

S-step

self._update_policy()

Prepare for next iteration

self._advance_iteration_variables()

def compute_costs_constrained(self, traj_distr, m, eta, augment=True):

""" Compute cost estimates used in the LQR backward pass. """

traj_info, traj_distr = self.cur[m].traj_info, traj_distr

if not augment:

maybe add a comment here for what this means

In python/gps/algorithm/algorithm_traj_opt_pilqr.py https://github.com/cbfinn/gps/pull/80#discussion_r108500672:

m: condition

pdpc: previous dynamics-previous controller cost

pdcc: previous dynamics-current controller cost

cdcc: current dynamics-current controller cost

pmc: previous actual cost

cmc: current actual cost

cur_res:

"""

def _stepadjust(self, m):

"""

Calculate new step sizes.

Args:

m: Condition

"""

pdpc = self.traj_opt.estimate_cost(

For readability, maybe add comment like "# previous dynamics, previous controller" here

In python/gps/algorithm/algorithm_traj_opt_pilqr.py https://github.com/cbfinn/gps/pull/80#discussion_r108500841:

self.cur[m].step_mult = self.prev[m].step_mult

elif self._hyperparams['step_rule'] == 'res_percent':

cur_res_sum = np.mean(np.sum(np.abs(cur_res), axis=1))

act_sum = np.mean(np.sum(np.abs(self.cur[m].cs), axis=1))

Print cost details.

LOGGER.debug(

'Cur residuals: %f. Residuals percentage: %f',

cur_res_sum, cur_res_sum/act_sum*100

)

if self.traj_opt.cons_per_step:

T = cur_res.shape[1]

new_mult = np.ones(T)

for t in range(T):

res = np.mean(np.sum(np.abs(cur_res[:, t:]), axis=1))

act = np.mean(np.sum(np.abs(self.cur[m].cs[:, t:]), axis=1))

if res/act > self._hyperparams['step_rule_res_ratio_dec']:

nitpic: there should be spaces around the "/"

In python/gps/algorithm/algorithm_traj_opt_pilqr.py https://github.com/cbfinn/gps/pull/80#discussion_r108501232:

K[t, :, :].T.dot(ipc[t, :, :]).dot(K[t, :, :]),

-K[t, :, :].T.dot(ipc[t, :, :])

]),

np.hstack([

-ipc[t, :, :].dot(K[t, :, :]), ipc[t, :, :]

])

])

fcv[t, :] += eta / (eta + multiplier) * np.hstack([

K[t, :, :].T.dot(ipc[t, :, :]).dot(k[t, :]),

-ipc[t, :, :].dot(k[t, :])

])

return fCm, fcv

def compute_costs(self, m, eta, augment=True):

return self._compute_costs(self.cur[m].traj_distr, m, eta, augment)

Why not just have a single function that sets traj_distr = self.cur[m].traj_distr at the top?

In python/gps/algorithm/cost/cost_fk_blocktouch.py https://github.com/cbfinn/gps/pull/80#discussion_r108501864:

@@ -0,0 +1,56 @@ +# TODO: This cost function is hard-coded and not easy to generalize.

Is this true? If so, can you put inline comments in parts where it is hard-coded? (in case someone wants to use it for a different scenario)

In python/gps/algorithm/cost/cost_fk_blocktouch.py https://github.com/cbfinn/gps/pull/80#discussion_r108501953:

@@ -0,0 +1,56 @@ +# TODO: This cost function is hard-coded and not easy to generalize. +import copy + +import numpy as np + +from gps.algorithm.cost.config import COST_FK +from gps.algorithm.cost.cost import Cost +from gps.algorithm.cost.cost_utils import get_ramp_multiplier +from gps.proto.gps_pb2 import JOINT_ANGLES, END_EFFECTOR_POINTS, \

END_EFFECTOR_POINT_JACOBIANS

+class CostFKBlock(Cost):

Can you add a brief description of this cost function? (like the binary cost description)

In python/gps/algorithm/policy_opt/tf_model_example.py https://github.com/cbfinn/gps/pull/80#discussion_r108502377:

@@ -90,6 +90,29 @@ def example_tf_network(dim_input=27, dim_output=7, batch_size=25, network_config return TfMap.init_from_lists([nn_input, action, precision], [mlp_applied], [loss_out]), fc_vars, []

+def tf_network(dim_input=27, dim_output=7, batch_size=25, network_config=None):

Can we just replace example_tf_network with this?

In python/gps/algorithm/traj_opt/config.py https://github.com/cbfinn/gps/pull/80#discussion_r108502519:

@@ -6,13 +6,28 @@

Dual variable updates for non-PD Q-function.

'del0': 1e-4, 'eta_error_threshold': 1e16,

'min_eta': 1e-4,

'min_eta': 1e-8,

How does this change affect other experiments?

In python/gps/algorithm/traj_opt/traj_opt_pi2.py https://github.com/cbfinn/gps/pull/80#discussion_r108502751:

@@ -39,7 +41,8 @@ def init(self, hyperparams): self._covariance_damping = self._hyperparams['covariance_damping'] self._min_temperature = self._hyperparams['min_temperature']

def update(self, m, algorithm):

def update(self, m, algorithm, use_lqr_actions=False,

fixed_eta=None, use_fixed_eta=False, costs=None):

update docstring

In python/gps/algorithm/traj_opt/traj_opt_pi2.py https://github.com/cbfinn/gps/pull/80#discussion_r108502840:

def update_pi2(self, samples, costs, mean_old, cov_old):

def update_pi2(self, samples, costs, mean_old, cov_old,

fixed_eta=None, use_fixed_eta=False):

update docstring

In python/gps/algorithm/traj_opt/traj_opt_pi2.py https://github.com/cbfinn/gps/pull/80#discussion_r108502937:

cov_new[t])) / len(cov_old[t])

mult = np.power(mult, 1 / self._covariance_damping)

cov_new[t] = mult * cov_old[t]

Compute covariance inverse and cholesky decomposition.

inv_cov_new[t] = sp.linalg.inv(cov_new[t])

chol_cov_new[t] = sp.linalg.cholesky(cov_new[t])

return mean_new, cov_new, inv_cov_new, chol_cov_new

etas = np.zeros(T)

del_ = self._hyperparams['del0']

if self._hyperparams['pi2_cons_per_step']:

del = np.ones(T) * del

fail = True

maybe add comment for what fail is.

In python/gps/algorithm/traj_opt/traj_opt_utils.py https://github.com/cbfinn/gps/pull/80#discussion_r108503207:

-def traj_distr_kl(new_mu, new_sigma, new_traj_distr, prev_traj_distr): + +def traj_distr_kl(new_mu, new_sigma, new_traj_distr, prev_traj_distr, tot=True):

update docstring

In python/gps/algorithm/traj_opt/traj_opt_utils.py https://github.com/cbfinn/gps/pull/80#discussion_r108503308:

@@ -83,4 +87,84 @@ def traj_distr_kl(new_mu, new_sigma, new_traj_distr, prev_traj_distr): )

Add up divergences across time to get total divergence.

return np.sum(kl_div)

return np.sum(kl_div) if tot else kl_div

+def traj_distr_kl_alt(new_mu, new_sigma, new_traj_distr, prev_traj_distr, tot=True):

add description of what this is, and ideally also a docstring

In python/gps/algorithm/traj_opt/traj_opt_utils.py https://github.com/cbfinn/gps/pull/80#discussion_r108503364:

mu, sigma = new_mu[t, :dX], new_sigma[t, :dX, :dX]

kl_div[t] = max(

0,

0.5 * (logdet_prev - logdet_new - new_traj_distr.dU +

np.sum(np.diag(inv_prev.dot(sig_new))) +

k_diff.T.dot(inv_prev).dot(k_diff) +

mu.T.dot(K_diff.T).dot(inv_prev).dot(K_diff).dot(mu) +

np.sum(np.diag(K_diff.T.dot(inv_prev).dot(K_diff).dot(sigma))) +

2 * k_diff.T.dot(inv_prev).dot(K_diff).dot(mu))

)

return np.sum(kl_div) if tot else kl_div

+def approximated_cost(sample_list, traj_distr, traj_info):

add description and docstring

In python/gps/algorithm/policy_opt/tf_model_example.py https://github.com/cbfinn/gps/pull/80#discussion_r108504271:

@@ -90,6 +90,29 @@ def example_tf_network(dim_input=27, dim_output=7, batch_size=25, network_config return TfMap.init_from_lists([nn_input, action, precision], [mlp_applied], [loss_out]), fc_vars, []

+def tf_network(dim_input=27, dim_output=7, batch_size=25, network_config=None):

"""

An example of how one might want to specify a network in tensorflow.

Args:

dim_input: Dimensionality of input.

dim_output: Dimensionality of the output.

batch_size: Batch size.

update docstring

In python/gps/algorithm/traj_opt/traj_opt_lqr_python.py https://github.com/cbfinn/gps/pull/80#discussion_r108505181:

@@ -189,7 +284,10 @@ def backward(self, prev_traj_distr, traj_info, eta, algorithm, m): dU = prev_traj_distr.dU dX = prev_traj_distr.dX

traj_distr = prev_traj_distr.nans_like()

if self._update_in_bwd_pass:

This PR adds quite a lot of code to the backward function. Would it make sense to make it more modular?

In python/gps/algorithm/traj_opt/traj_opt_lqr_python.py https://github.com/cbfinn/gps/pull/80#discussion_r108505420:

if np.any(np.isnan(Fm)) or np.any(np.isnan(fv)): raise ValueError('NaNs encountered in dynamics!') raise ValueError('Failed to find PD solution even for very \ large eta (check that dynamics and cost are \ reasonably well conditioned)!') return traj_distr, eta +

def _conv_check(self, con, kl_step):

maybe add brief description of what this is.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cbfinn/gps/pull/80#pullrequestreview-29534663, or mute the thread https://github.com/notifications/unsubscribe-auth/ACS5_ 1KktYEVa8BjiEV0tHe4sGUgdiBjks5rqVX3gaJpZM4MjVyF .

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/cbfinn/gps/pull/80#issuecomment-289914196, or mute the thread https://github.com/notifications/unsubscribe-auth/ABMlARwXSIOFP2PIBHHRw4iuAvqLLRdaks5rqX8igaJpZM4MjVyF .

zhangmarvin commented 7 years ago

Ready for you to take another look, @cbfinn. Regarding the README: I feel that's a bit unnecessary, since reproducing the results in the paper is as simple as running the mjc_gripper_pusher and mjc_door_opening experiments, i.e., unlike SSRL, there's nothing complicated or unusual compared to normal GPS.

zhangmarvin commented 7 years ago

I've added 'pilqr' to the experiment names and done the remaining cleanup.

cbfinn commented 7 years ago

LGTM!

AswinRetnakumar commented 4 years ago

Hey is there a way to integrate pilqr algorithm with pr2 robot for moving the robot hand the a point by changing the hyperparameter values.

cbfinn / gps

PILQR #80

(e.g. see the readme in the SSRL branch.)

S-step

Prepare for next iteration

maybe add a comment here for what this means

For readability, maybe add comment like "# previous dynamics, previous controller" here

Print cost details.

nitpic: there should be spaces around the "/"

Why not just have a single function that sets traj_distr = self.cur[m].traj_distr at the top?

Is this true? If so, can you put inline comments in parts where it is hard-coded? (in case someone wants to use it for a different scenario)

Can you add a brief description of this cost function? (like the binary cost description)

Can we just replace example_tf_network with this?

Dual variable updates for non-PD Q-function.

How does this change affect other experiments?

update docstring

update docstring

Compute covariance inverse and cholesky decomposition.

maybe add comment for what fail is.

update docstring

add description of what this is, and ideally also a docstring

add description and docstring

update docstring

This PR adds quite a lot of code to the backward function. Would it make sense to make it more modular?

(e.g. see the readme in the SSRL branch.)

S-step

Prepare for next iteration

maybe add a comment here for what this means

For readability, maybe add comment like "# previous dynamics, previous controller" here

Print cost details.

nitpic: there should be spaces around the "/"

Why not just have a single function that sets traj_distr = self.cur[m].traj_distr at the top?

Is this true? If so, can you put inline comments in parts where it is hard-coded? (in case someone wants to use it for a different scenario)

Can you add a brief description of this cost function? (like the binary cost description)

Can we just replace example_tf_network with this?

Dual variable updates for non-PD Q-function.

How does this change affect other experiments?

update docstring

update docstring

Compute covariance inverse and cholesky decomposition.

maybe add comment for what fail is.

update docstring

Add up divergences across time to get total divergence.

add description of what this is, and ideally also a docstring

add description and docstring

update docstring

This PR adds quite a lot of code to the backward function. Would it make sense to make it more modular?