Open danijar opened 8 years ago
Oh interesting! I will definitely take a look at this. Thank you.
I noticed this as well and believe it's a significant cause of performance degradation. Additionally, you don't seem to be adding the entropy term to the objective which they mention in the paper as being useful for improving exploration.
I think you should use
tf.stop_gradient()
in https://github.com/coreylynch/async-rl/blob/master/a3c.py#L164. Otherwise, after some training the policy tends to use one action exclusively. Took me a while to figure this out in my own code, too.