If I am not mistaken the j-1 subscript refers to current state in the implementation, i.e. state, action, reward, done all refer to j-1 . And new_state refers to j
Then line 125 in ddqn.py refers to arg max of itself not to the previous one:
q_val = self.agent.predict(state)next_best_action = np.argmax(q_val)
It seems like if I compare from [https://arxiv.org/pdf/1511.05952.pdf](PER paper):
Algorithm 1: line 11 TD error
delta(j) = Reward(j) + gamma(j) * Q_target(S_j, arg max_a Q(S_j, a)) - Q(S_j-1, A_j-1)
If I am not mistaken the j-1 subscript refers to current state in the implementation, i.e. state, action, reward, done all refer to
j-1
. And new_state refers toj
Then line 125 in ddqn.py refers to arg max of itself not to the previous one:
q_val = self.agent.predict(state)
next_best_action = np.argmax(q_val)
should be
q_val = self.agent.predict(new_state)
next_best_action = np.argmax(q_val)