Wrong A3C implementation

NVlabs / GA3C

Hybrid CPU/GPU implementation of the A3C algorithm for deep reinforcement learning.

BSD 3-Clause "New" or "Revised" License

652 stars 195 forks source link

I believe there is a bug in the A3C algorithm implementation. In the file "ProcessAgent.py" on line 107. The sub-episode return should be the value in the next state not the previous.

I suggest replacing:

prediction, value = self.predict(self.env.current_state)

...
            if done or time_count == Config.TIME_MAX:
                terminal_reward = 0 if done else value

with:

prediction, value = self.predict(self.env.current_state)

...
            if done or time_count == Config.TIME_MAX:
                terminal_reward = 0
               if not done:
                     (_, terminal_reward) = self.predict(self.env.current_state)

prediction, value = self.predict(self.env.current_state) action = self.select_action(prediction) reward, done = self.env.step(action) reward_sum += reward exp = Experience(self.env.previous_state, action, prediction, reward, done)

NVlabs / GA3C

Wrong A3C implementation #40