Closed random-user-x closed 6 years ago
@Kaixhin , I think you have approximated the trust region update in the wrong way. The present implementation might not the one which is discussed in the paper. Would you like me to open a PR on the correct implementation?
Yes for certain the update should be done between the softmax and the input to the softmax, rather than on the parameters of the policy head, so it's not following the paper at the moment. If you've got the correct implementation then please open a PR.
@Kaixhin I think it is better to detach z_star_p. Please let me know how you feel about this.