Option Critic e-greedy option update question

ShangtongZhang / DeepRL

Modularized Implementation of Deep RL Algorithms in PyTorch

MIT License

3.21k stars 684 forks source link

Option Critic e-greedy option update question #87

Closed spacegoing closed 4 years ago

spacegoing commented 4 years ago

Hi Shangtong,

I would like to know in this line:

https://github.com/ShangtongZhang/DeepRL/blob/64145c3ae755dbc47bc6b902114600ccd43c808c/deep_rl/agent/OptionCritic_agent.py#L97

Why are you using $r_{t+1} + \gamma U(Ot, S{t+1}) - Q_{\omega}(O_t,S_t)$ to calculate advantage instead of using adv = v - storage.q[i].gather(1, storage.option[i]) where v is the variable defined on line 101. Which would be $V(S_t) - Q(O_t,S_t)$ as "regular" advantage function for MDP?

Many thanks!

ShangtongZhang commented 4 years ago

Why is $V(S_t) - Q(O_t,S_t)$ an advantage? Even $Q(O_t,S_t) - V(S_t) is not an advantage. For the low-MDP, the state is (S_t, Ot), $r{t+1} + \gamma U(Ot, S{t+1})$ is a sample of q(S_t, O_t, A_t).

spacegoing commented 4 years ago

Hi Shangtong,

Many thanks for pointing those out. It is much clear now and it looks like I was so confused earlier :P

Have a great day!