Closed Feryal closed 7 years ago
Can I see a comparison of results before and after please? If there's a performance regression, I'll still merge this, but it'd still be good to document. Thanks!
According to the A3C paper:
We also found that adding the entropy of the policy π to the objective function improved exploration by discouraging premature convergence to suboptimal deterministic policies.
The logic seems to be that the entropy maximisation helps with preventing convergence to a single choice of policy especially when several choices all lead to equally good returns.
also your math, dθ ← dθ - β∙∇θH(π(s_i; θ)) suggests the same.
…late exploration