Open wangsd01 opened 6 years ago
This part is to add divergence of predicted trajectory and sampled trajectory as additional cost. i.e. (Kx + k - u).T inverse_policy_variance_matrix (Kx+k -u) u is sampled action from data. Kx + k is predicted action from global policy network.
@wangsd01 Hi, I am also looking at these lines. Have you solved the problem?
I am not 100% sure what's happening, but one thing that looks especially suspicious to me is that the derivative to u is Cov^{-1}.dot(k_old).
In the code repo, by looking at the forward pass, it uses u = Kx + k, instead of u = K(x-x_old) + k + u_old. And therefore, I kinda feel that if we actually take the derivative of the KL penalty wrt u, we will have something like Cov^{-1}.dot(u_new - u_old) = Cov^{-1}.dot(K_new.dot(x) - K_old.dot(x) + k_new - k_old) != Cov^{-1}.dot(k_old).
Not sure if I missed anything. Be great if you could help :( @cbfinn
Could you help to give any reference to this part of code? Thank you!