Open wechto opened 5 years ago
Sorry for delay. The intrinsic reward was added on policy loss through
pi_loss = -(logp*(k*adv+pos)).mean()
The pos
term represents log q(z|\tau).
@O151 And the reward in VALOR is not the same as VIC. VALOR calculates the log possibility conditioned on the whole trajectory and evenly adds it as intrinsic reward on every step. The -logp(z) term is a constant since the prior is uniform. So it's not necessary in reward.
Hello, I am a student who is just beginning to learn RL, I have run the examples of VALOR code.
I have a question about the reward, which is gotten from the extrinsic environment. I think the reward should come from equation (2) https://arxiv.org/pdf/1807.10299.pdf. Following the similar operation such as VIC and DIAYN, I think the reward should follow the equation:
So my question is that why the code does not use the same intrinsic reward?
thank you in advance.