calulation of Q in chapter 6 pong

PacktPublishing / Deep-Reinforcement-Learning-Hands-On

Hands-on Deep Reinforcement Learning, published by Packt

MIT License

2.83k stars 1.28k forks source link

https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On/blob/2e171a12b347bdefc72a77b11753975198b7c8d1/Chapter06/02_dqn_pong.py#L88-L103

in Agent.play_step: an Experience is created that records: the current state (observation), the action taken, the transition reward, & the next state (observation)

in Agent.calc_loss: the Experience is unpacked & tensored. then the next state (observation) is passed to the target neural net (tgt_net) to get (predicted) score for all actions in the next state. the (predicted) maximum score for the next state is determined (next_state_values).

now, here is where i'm confused — instead of calculating Q for the current state as: reward at the current state + the gamma (discount) max Q at the new state it looks (to me) like it calculates instead, at line 102: max Q at the new state + the gamma (discount) reward at the current state

PacktPublishing / Deep-Reinforcement-Learning-Hands-On

calulation of Q in chapter 6 pong #14