Cannot learn problems with a single, terminal reward

NVlabs / GA3C

Hybrid CPU/GPU implementation of the A3C algorithm for deep reinforcement learning.

BSD 3-Clause "New" or "Revised" License

649 stars 195 forks source link

Cannot learn problems with a single, terminal reward #31

Open wagnew3 opened 7 years ago

wagnew3 commented 7 years ago

Thank you for the easy to use and fast A3C implementation. I created a simple problem for rapid testing that rewards 0 on all steps except the terminal step, where it rewards either -1 or 1. GA3C cannot learn this problem because of line 107 in ProcessAgent.py:

terminal_reward = 0 if done else value

which causes the agent to ignore the only meaningful reward in this environment, and line 63 in ProcessAgent.py:

return experiences[:-1]

which causes the agent to ignore the only meaningful experience in this environment.

This is easily fixed by changing line 107 in ProcessAgent.py to

terminal_reward = reward if done else value

and _accumulate_rewards() in ProcessAgent.py to return all experiences if the agent has taken a terminal step. These changes should generally increase performance as terminal steps often contain valuable reward signal.

tangbohu commented 6 years ago

Hi, William! I think you are right. But have you validated it by conducting some experiments?

wagnew3 commented 6 years ago

On my toy problem which only has a nonzero reward on the terminal step, the agent cannot learn without this change. I haven't tested this on more complex problems like Atari games (I imagine that the impact shouldn't be too large as these games usually have many rewards on non-terminal steps)

ifrosio commented 6 years ago

Hi, thanks for noticing this. Our implementation of A3C is indeed coherent with the original algo (see https://arxiv.org/pdf/1602.01783.pdf, page 14, where the reward is set to 0 for a terminal state). My intuition is that this is done because the expected value for the final state can only be zero (no rewards expected in the future). Nonetheless, your fix should allow using the algorithm also in the case of a game with only one, final reward. This being said, I am not sure that A3C is the best algorithm for this case - you may have to dramatically change some of the hyper-parameters of the algo (t_max, for instance) to see some convergence, and in any case I do not expect convergence to be fast. This also obviously depends on the length of the episode in your toy-game.

wagnew3 commented 6 years ago

A3C correctly sets the value of terminal states to 0, but keeps the reward these terminal states give ("R ← r_i + γR" in the A3C pseudocode). GA3C sets both the reward and value of the terminal state to 0. In Pong, for example, where the terminal state also has a reward of -1 or 1 (indicating getting scored on or scoring), this causes no useful learning to happen in the experiences of the last round of play.