Option-Critic Beta Advantage Question

ShangtongZhang / DeepRL

Modularized Implementation of Deep RL Algorithms in PyTorch

MIT License

3.21k stars 684 forks source link

Option-Critic Beta Advantage Question #73

Closed spacegoing closed 4 years ago

spacegoing commented 4 years ago

Hi Shangtong,

https://github.com/ShangtongZhang/DeepRL/blob/717fe68e7ed00a80c6c52ec9613c9a16dbb37e0c/deep_rl/agent/OptionCritic_agent.py#L101

I would love to know how the value function is calculated here. Why isn't it just V=maxQ rather than the expectation over p(w|s, epsilon)?

Also, should it not be 1 - epsilon + epsilon / q_option.size(1) according to https://github.com/ShangtongZhang/DeepRL/blob/717fe68e7ed00a80c6c52ec9613c9a16dbb37e0c/deep_rl/agent/OptionCritic_agent.py#L34

spacegoing commented 4 years ago

Also, why would here use sampled option rather than max: https://github.com/ShangtongZhang/DeepRL/blob/717fe68e7ed00a80c6c52ec9613c9a16dbb37e0c/deep_rl/agent/OptionCritic_agent.py#L97

ShangtongZhang commented 4 years ago

Hi Shangtong,

https://github.com/ShangtongZhang/DeepRL/blob/717fe68e7ed00a80c6c52ec9613c9a16dbb37e0c/deep_rl/agent/OptionCritic_agent.py#L101

I would love to know how the value function is calculated here. Why isn't it just V=maxQ rather than the expectation over p(w|s, epsilon)?

Also, should it not be 1 - epsilon + epsilon / q_option.size(1) according to

https://github.com/ShangtongZhang/DeepRL/blob/717fe68e7ed00a80c6c52ec9613c9a16dbb37e0c/deep_rl/agent/OptionCritic_agent.py#L34

In L101 I interpret the q value as the Q\Omega in Eq1 in the OC paper (In fact this is wrong as this q value is trained via intra-option q-learning instead of SARSA) corresponding to the epsilon-greedy policy. So v is computed by weighted sum of Q\Omega according to this epsilon-greedy policy. L34 is the weight for the q value corresponding to the greedy action. I think it's equivalent to L101.

ShangtongZhang commented 4 years ago

Also, why would here use sampled option rather than max: https://github.com/ShangtongZhang/DeepRL/blob/717fe68e7ed00a80c6c52ec9613c9a16dbb37e0c/deep_rl/agent/OptionCritic_agent.py#L97

This is the advantage of an action at an augmented state (state \times option), option is part of this augmented state (see my DAC paper for details about this augmented MDP).

spacegoing commented 4 years ago

Thank you for your reply. After reading your DAC paper I became a fan of you:D Brilliant work mate! I am only recently attracted by reinforcement learning and found it such a fascinating area. If it's possible could you please share your learning path (textbooks etc.)? I found myself lacking of many backgrounds such as augmented MDP and kind of confused where to start. Many thanks!

ShangtongZhang commented 4 years ago

Rich's book -> Martin Puterman's book about MDP -> Neuro Dynamic Programming from Bertsekas

spacegoing commented 4 years ago

Great! Many thanks!