Closed jkulhanek closed 4 years ago
I think that the overall design is of a different approach:
prediction, value = self.predict(self.env.current_state)
action = self.select_action(prediction)
reward, done = self.env.step(action)
reward_sum += reward
exp = Experience(self.env.previous_state, action, prediction, reward, done)
The last line indicates that experience is backward looking. I assume that's why the terminal_reward that is equal to value is consistent when done is False?
I believe there is a bug in the A3C algorithm implementation. In the file "ProcessAgent.py" on line 107. The sub-episode return should be the value in the next state not the previous.
I suggest replacing:
with: