Open Shapeno opened 4 months ago
In obj_alpha = (self.alpha_log * (self.target_entropy - log_prob).detach()).mean() when alpha_log=0, alpha will be 1forever. the correct way is obj_alpha = (self.alpha * (self.target_entropy - log_prob).detach()).mean() .
obj_alpha = (self.alpha_log * (self.target_entropy - log_prob).detach()).mean()
obj_alpha = (self.alpha * (self.target_entropy - log_prob).detach()).mean()
this problem is also found in rlkit.
Algorithm details in the source code of : https://github.com/rail-berkeley/softlearning/blob/13cf187cc93d90f7c217ea2845067491c3c65464/softlearning/algorithms/sac.py#L256
https://github.com/AI4Finance-Foundation/ElegantRL/blob/b4b9d662b9f9cb7cc368ac2b1036b5119eb20be4/elegantrl/agents/AgentSAC.py#L48C13-L48C23
In
obj_alpha = (self.alpha_log * (self.target_entropy - log_prob).detach()).mean()
when alpha_log=0, alpha will be 1forever. the correct way isobj_alpha = (self.alpha * (self.target_entropy - log_prob).detach()).mean()
.this problem is also found in rlkit.
Algorithm details in the source code of:
https://github.com/rail-berkeley/softlearning/blob/13cf187cc93d90f7c217ea2845067491c3c65464/softlearning/algorithms/sac.py#L256