I was reviewing the this code, then I thought that it is possible that the policy loss might impact on value branch net.
If you take a look at line 143 which is adv_v = vals_ref_v - value_v.detach() that is computing advantage, the value_v is detached to prevent policy loss to impact on value net in the backward process, but if you consider computing vals_ref_v which is conducted by function unpack_batch, then you will find out at line 90 last_vals_v = net(last_states_v)[1] the value net is involved in computing the vals_ref_v.
In result I think that the line 143 must get changed from adv_v = vals_ref_v - value_v.detach() to adv_v = (vals_ref_v - value_v).detach()
I was reviewing the this code, then I thought that it is possible that the policy loss might impact on value branch net. If you take a look at line 143 which is
adv_v = vals_ref_v - value_v.detach()
that is computing advantage, the value_v is detached to prevent policy loss to impact on value net in the backward process, but if you consider computing vals_ref_v which is conducted by functionunpack_batch
, then you will find out at line 90last_vals_v = net(last_states_v)[1]
the value net is involved in computing thevals_ref_v
. In result I think that the line 143 must get changed fromadv_v = vals_ref_v - value_v.detach()
toadv_v = (vals_ref_v - value_v).detach()