PacktPublishing / Deep-Reinforcement-Learning-Hands-On

Hands-on Deep Reinforcement Learning, published by Packt
MIT License
2.83k stars 1.28k forks source link

calulation of Q in chapter 6 pong #14

Closed autohandle closed 5 years ago

autohandle commented 5 years ago

https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On/blob/2e171a12b347bdefc72a77b11753975198b7c8d1/Chapter06/02_dqn_pong.py#L88-L103

in Agent.play_step: an Experience is created that records: the current state (observation), the action taken, the transition reward, & the next state (observation)

in Agent.calc_loss: the Experience is unpacked & tensored. then the next state (observation) is passed to the target neural net (tgt_net) to get (predicted) score for all actions in the next state. the (predicted) maximum score for the next state is determined (next_state_values).

now, here is where i'm confused — instead of calculating Q for the current state as: reward at the current state + the gamma (discount) max Q at the new state it looks (to me) like it calculates instead, at line 102: max Q at the new state + the gamma (discount) reward at the current state

Shmuma commented 5 years ago

Hi!

Sorry for the long reply.

now, here is where i'm confused — instead of calculating Q for the current state as: reward at the current state + the gamma (discount) max Q at the new state it looks (to me) like it calculates instead, at line 102: max Q at the new state + the gamma (discount) reward at the current state

Please check line 102 carefully (as you quoted it): expected_state_action_values = next_state_values * GAMMA + rewards_v

Multiplication is between GAMMA and next_state_values, then reward is added. So, the result is comply the Bellman equation.