Closed miriaford closed 4 years ago
You still backprop from the critic, so the encoder gets gradients from Q value estimation. The only detached parts are gradients from the actor, which results in more stable policies.
afaik empirical
On Tue, May 5, 2020 at 2:48 PM Miria Ford notifications@github.com wrote:
Thanks! Is there any literature to back this up? Or is it purely empirical?
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/MishaLaskin/rad/issues/4#issuecomment-624238698, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHWQWMGIBS6D5NYOAZGHATRQBNOFANCNFSM4MZHWSPQ .
I might have missed something simple, but could you please kindly explain why don't you update the encoder part?
https://github.com/MishaLaskin/rad/blob/master/curl_sac.py#L411-L413
In other SAC implementations (e.g. rlkit), the gradient back-props through the entire policy network. Thanks!