dennybritz / reinforcement-learning

Implementation of Reinforcement Learning Algorithms. Python, OpenAI Gym, Tensorflow. Exercises and Solutions to accompany Sutton's Book and David Silver's course.
http://www.wildml.com/2016/10/learning-reinforcement-learning/
MIT License
20.61k stars 6.04k forks source link

What's the difference between baseline solution and Actor-Critic #116

Open droiter opened 7 years ago

droiter commented 7 years ago

I think td_error in AC is same with advantage in baseline solution, which are all reward minus predicted value.

One difference is AC value network is learning in TD, baseline solution is learning directly

I think they are the same in essence.

unnir commented 7 years ago

would love to know too...

alexlimh commented 7 years ago

2017-11-03 3:49 GMT-06:00 unnir notifications@github.com:

would love to know too...

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dennybritz/reinforcement-learning/issues/116#issuecomment-341659058, or mute the thread https://github.com/notifications/unsubscribe-auth/ASeJ59VOQY8iEKW7NOVCrZzmFZnuFsXIks5syuGygaJpZM4QBbYS .

alexlimh commented 7 years ago

Sorry, sent to the wrong person.

2017-11-03 17:33 GMT-06:00 Minghan Li alexlimh23@gmail.com:

2017-11-03 3:49 GMT-06:00 unnir notifications@github.com:

would love to know too...

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dennybritz/reinforcement-learning/issues/116#issuecomment-341659058, or mute the thread https://github.com/notifications/unsubscribe-auth/ASeJ59VOQY8iEKW7NOVCrZzmFZnuFsXIks5syuGygaJpZM4QBbYS .

cinqs commented 6 years ago

please see this article to see if it may help. https://www.quora.com/What-is-the-difference-between-policy-gradient-methods-and-actor-critic-methods

epignatelli commented 3 years ago

I just bumped into this post, which is quite old, but this might be useful for other people who do the same.

To differentiate between Policy Gradient methods and Actor-Critic methods, Sutton refers to whether the algotithm uses value estimates to approximate the value function in the next step or not. From Sutton & Barto 2018 (2nd edition), pp.331-332 :

In REINFORCE with baseline, the learned state-value function estimates the value of the only the first state of each state transition. This estimate sets a baseline for the subsequent return, but is made prior to the transition’s action and thus cannot be used to assess that action. In actor–critic methods, on the other hand, the state-value function is applied also to these cond state of the transition. The estimated value of the second state, when discounted and added to the reward, constitutes the one-step return,Gt:t+1,which is a useful estimate of the actual return and thus is a way of assessing the action. As we have seen in the TD learning of value functions throughout this book, the one-step return is often superior to the actual return in terms of its variance and computational congeniality, even though it introduces bias. We also know how we can flexibly modulate the extent of the bias with n-step returns and eligibility traces (Chapters 7 and 12). When the state-value function is used to assess actions in this way it is called acritic,and the overall policy-gradient method is termed an actor–critic method. Note that the bias in the gradient estimate is not due to bootstrapping as such; the actor would be biased even if the critic was learned by a Monte Carlo method