Steven-Ho / VALOR

Implementation of VALOR (Variational Option Discovery Algorithms)
10 stars 5 forks source link

Reward - intrinsic reward or extrinsic reward? #1

Open wechto opened 5 years ago

wechto commented 5 years ago

Hello, I am a student who is just beginning to learn RL, I have run the examples of VALOR code.

I have a question about the reward, which is gotten from the extrinsic environment. I think the reward should come from equation (2) https://arxiv.org/pdf/1807.10299.pdf. Following the similar operation such as VIC and DIAYN, I think the reward should follow the equation:

$rt = log q{\phi}(z|s_{t}) - log p(z)$ image

So my question is that why the code does not use the same intrinsic reward?

thank you in advance.

Steven-Ho commented 5 years ago

Sorry for delay. The intrinsic reward was added on policy loss through pi_loss = -(logp*(k*adv+pos)).mean() The pos term represents log q(z|\tau).

Steven-Ho commented 5 years ago

@O151 And the reward in VALOR is not the same as VIC. VALOR calculates the log possibility conditioned on the whole trajectory and evenly adds it as intrinsic reward on every step. The -logp(z) term is a constant since the prior is uniform. So it's not necessary in reward.