Alfredvc / paac

Open source implementation of the PAAC algorithm presented in Efficient Parallel Methods for Deep Reinforcement Learning
https://arxiv.org/abs/1705.04862
Other
206 stars 60 forks source link

Low Seaquest avg score compared to A3C #3

Open beniz opened 6 years ago

beniz commented 6 years ago

Looking at a handful A3C implementations and results on Seaquest, they appear to score around 50K:

PAAC however, reaches a plateau around 2K according to our tests (similar to your paper). Visual inspection of the policy shows that the submarine does not resurface. While a common difficulty of the game, A3C appears to be able to overcome it (maybe this could be due to a modification in OpenAI Gym since their Atari setup has some differences with ALE).

We've looked at various explorations (e-greedy, boltzmann, bayesian dropout), with no improvement at the moment.

Do you seen any particular reason PAAC would underperform in this case ? LSTM might help, but from the two OpenAI Gym pointers above, it seems it should not be critical for Seaquest.

Alfredvc commented 6 years ago

Hi,

Seaquest was part of the "test set". Meaning that only the final algorithm, with the final set of hyperparameters was tested on the game, so I know little about the specifics of the learning process in that game. However, I may be able to give you some avenues of experimentation.

I have heard from other researchers that adding a "delay" between starting the different threads in A3C helps with learning. An analog to that for PAAC would be to add, only at the very beginning of training, a random amount of random actions to each environment before starting to learn. This would lead to the different environments being in different stages of the game at the beginning of training.

Since you are experimenting with different exploration techniques you could also try increasing the policy entropy constant in the loss. Or even starting with a high constant and annealing it over time. This constant regulates how "preferable" having a uniform policy is, relative to higher return. No entropy loss leads to very fast convergence to a near deterministic policy, while a high constant leads to a very uniform policy.