Open marioyc opened 4 years ago
for DQN its only Q_targets_next:
but for IQN you are right :)
Oh, I didn't notice that, seems to contradict equation 2, and it would also change the logsumexp calculations, given that these assume the q values are divided by entropy_tau
Confirmed with the author that it is a typo, the values should be divided by entropy_tau
.
Also there is a TF implementation here: https://github.com/google-research/google-research/tree/master/munchausen_rl
@marioyc Thank you! I'll fix it :)
Should
F.softmax(Q_targets_next, dim=1)
beF.softmax(Q_targets_next / entropy_tau, dim=1)
instead?