NVlabs / GA3C

Hybrid CPU/GPU implementation of the A3C algorithm for deep reinforcement learning.
BSD 3-Clause "New" or "Revised" License
652 stars 195 forks source link

Wrong A3C implementation #40

Closed jkulhanek closed 4 years ago

jkulhanek commented 5 years ago

I believe there is a bug in the A3C algorithm implementation. In the file "ProcessAgent.py" on line 107. The sub-episode return should be the value in the next state not the previous.

I suggest replacing:

prediction, value = self.predict(self.env.current_state)

...
            if done or time_count == Config.TIME_MAX:
                terminal_reward = 0 if done else value

with:

prediction, value = self.predict(self.env.current_state)

...
            if done or time_count == Config.TIME_MAX:
                terminal_reward = 0
               if not done:
                     (_, terminal_reward) = self.predict(self.env.current_state)
wgeul commented 4 years ago

I think that the overall design is of a different approach:

            prediction, value = self.predict(self.env.current_state)
            action = self.select_action(prediction)
            reward, done = self.env.step(action)
            reward_sum += reward
            exp = Experience(self.env.previous_state, action, prediction, reward, done)

The last line indicates that experience is backward looking. I assume that's why the terminal_reward that is equal to value is consistent when done is False?