TD target does not have detach(), which is equivalent to not using semi-gradient method;
During debugging, I found that the outputs of policy_net and target_net are the same, which is because no they maintain the same network, and target_net changes immediately after policy_net changes.
Thanks for your PR, I will consider you suggestions. But now we are trying to update new template for all algos, thus I cannot merge your PR now. I will add acknowledge of you when update Double DQN
I found two problems with the Double DQN code.
detach()
, which is equivalent to not using semi-gradient method;policy_net
andtarget_net
are the same, which is because no they maintain the same network, andtarget_net
changes immediately afterpolicy_net
changes.