Closed spacegoing closed 4 years ago
Why is $V(S_t) - Q(O_t,S_t)$ an advantage? Even $Q(O_t,S_t) - V(S_t) is not an advantage. For the low-MDP, the state is (S_t, Ot), $r{t+1} + \gamma U(Ot, S{t+1})$ is a sample of q(S_t, O_t, A_t).
Hi Shangtong,
Many thanks for pointing those out. It is much clear now and it looks like I was so confused earlier :P
Have a great day!
Hi Shangtong,
I would like to know in this line:
https://github.com/ShangtongZhang/DeepRL/blob/64145c3ae755dbc47bc6b902114600ccd43c808c/deep_rl/agent/OptionCritic_agent.py#L97
Why are you using $r_{t+1} + \gamma U(Ot, S{t+1}) - Q_{\omega}(O_t,S_t)$ to calculate advantage instead of using
adv = v - storage.q[i].gather(1, storage.option[i])
wherev
is the variable defined on line 101. Which would be $V(S_t) - Q(O_t,S_t)$ as "regular" advantage function for MDP?Many thanks!