PacktPublishing / Deep-Reinforcement-Learning-Hands-On-Second-Edition

Deep-Reinforcement-Learning-Hands-On-Second-Edition, published by Packt
MIT License
1.17k stars 545 forks source link

pong_A2C might backward policy-loss to value net #60

Open aminzakizebarjad opened 2 years ago

aminzakizebarjad commented 2 years ago

I was reviewing the this code, then I thought that it is possible that the policy loss might impact on value branch net. If you take a look at line 143 which is adv_v = vals_ref_v - value_v.detach() that is computing advantage, the value_v is detached to prevent policy loss to impact on value net in the backward process, but if you consider computing vals_ref_v which is conducted by function unpack_batch, then you will find out at line 90 last_vals_v = net(last_states_v)[1] the value net is involved in computing the vals_ref_v. In result I think that the line 143 must get changed from adv_v = vals_ref_v - value_v.detach() to adv_v = (vals_ref_v - value_v).detach()