Kaixhin / ACER

Actor-critic with experience replay
MIT License
251 stars 46 forks source link

entropy cost sign changed as the entropy should be maximised to stimu… #2

Closed Feryal closed 7 years ago

Feryal commented 7 years ago

…late exploration

Kaixhin commented 7 years ago

Can I see a comparison of results before and after please? If there's a performance regression, I'll still merge this, but it'd still be good to document. Thanks!

Feryal commented 7 years ago

According to the A3C paper:

We also found that adding the entropy of the policy π to the objective function improved exploration by discouraging premature convergence to suboptimal deterministic policies.

The logic seems to be that the entropy maximisation helps with preventing convergence to a single choice of policy especially when several choices all lead to equally good returns.

also your math, dθ ← dθ - β∙∇θH(π(s_i; θ)) suggests the same.