pong_A2C might backward policy-loss to value net

I was reviewing the this code, then I thought that it is possible that the policy loss might impact on value branch net. If you take a look at line 143 which is adv_v = vals_ref_v - value_v.detach() that is computing advantage, the value_v is detached to prevent policy loss to impact on value net in the backward process, but if you consider computing vals_ref_v which is conducted by function unpack_batch, then you will find out at line 90 last_vals_v = net(last_states_v)[1] the value net is involved in computing the vals_ref_v. In result I think that the line 143 must get changed from adv_v = vals_ref_v - value_v.detach() to adv_v = (vals_ref_v - value_v).detach()

PacktPublishing / Deep-Reinforcement-Learning-Hands-On-Second-Edition

pong_A2C might backward policy-loss to value net #60