cbfinn / gps

Guided Policy Search
http://rll.berkeley.edu/gps/
Other
593 stars 239 forks source link

How "trajectory divergence term" is calculated in compute_cost function in algorithm_traj_opt.py #98

Open wangsd01 opened 6 years ago

wangsd01 commented 6 years ago

Could you help to give any reference to this part of code? Thank you!

def compute_costs(self, m, eta, augment=True):
    """ Compute cost estimates used in the LQR backward pass. """
    traj_info, traj_distr = self.cur[m].traj_info, self.cur[m].traj_distr
    if not augment:  # Whether to augment cost with term to penalize KL
        return traj_info.Cm, traj_info.cv

    multiplier = self._hyperparams['max_ent_traj']
    fCm, fcv = traj_info.Cm / (eta + multiplier), traj_info.cv / (eta + multiplier)
    K, ipc, k = traj_distr.K, traj_distr.inv_pol_covar, traj_distr.k

    # Add in the trajectory divergence term.
    for t in range(self.T - 1, -1, -1):
        fCm[t, :, :] += eta / (eta + multiplier) * np.vstack([
            np.hstack([
                K[t, :, :].T.dot(ipc[t, :, :]).dot(K[t, :, :]),
                -K[t, :, :].T.dot(ipc[t, :, :])
            ]),
            np.hstack([
                -ipc[t, :, :].dot(K[t, :, :]), ipc[t, :, :]
            ])
        ])
        fcv[t, :] += eta / (eta + multiplier) * np.hstack([
            K[t, :, :].T.dot(ipc[t, :, :]).dot(k[t, :]),
            -ipc[t, :, :].dot(k[t, :])
        ])

    return fCm, fcv
wangsd01 commented 6 years ago

This part is to add divergence of predicted trajectory and sampled trajectory as additional cost. i.e. (Kx + k - u).T inverse_policy_variance_matrix (Kx+k -u) u is sampled action from data. Kx + k is predicted action from global policy network.

WilsonWangTHU commented 5 years ago

@wangsd01 Hi, I am also looking at these lines. Have you solved the problem?

I am not 100% sure what's happening, but one thing that looks especially suspicious to me is that the derivative to u is Cov^{-1}.dot(k_old).

In the code repo, by looking at the forward pass, it uses u = Kx + k, instead of u = K(x-x_old) + k + u_old. And therefore, I kinda feel that if we actually take the derivative of the KL penalty wrt u, we will have something like Cov^{-1}.dot(u_new - u_old) = Cov^{-1}.dot(K_new.dot(x) - K_old.dot(x) + k_new - k_old) != Cov^{-1}.dot(k_old).

Not sure if I missed anything. Be great if you could help :( @cbfinn