Open droiter opened 7 years ago
would love to know too...
2017-11-03 3:49 GMT-06:00 unnir notifications@github.com:
would love to know too...
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dennybritz/reinforcement-learning/issues/116#issuecomment-341659058, or mute the thread https://github.com/notifications/unsubscribe-auth/ASeJ59VOQY8iEKW7NOVCrZzmFZnuFsXIks5syuGygaJpZM4QBbYS .
Sorry, sent to the wrong person.
2017-11-03 17:33 GMT-06:00 Minghan Li alexlimh23@gmail.com:
2017-11-03 3:49 GMT-06:00 unnir notifications@github.com:
would love to know too...
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dennybritz/reinforcement-learning/issues/116#issuecomment-341659058, or mute the thread https://github.com/notifications/unsubscribe-auth/ASeJ59VOQY8iEKW7NOVCrZzmFZnuFsXIks5syuGygaJpZM4QBbYS .
please see this article to see if it may help. https://www.quora.com/What-is-the-difference-between-policy-gradient-methods-and-actor-critic-methods
I just bumped into this post, which is quite old, but this might be useful for other people who do the same.
To differentiate between Policy Gradient methods and Actor-Critic methods, Sutton refers to whether the algotithm uses value estimates to approximate the value function in the next step or not. From Sutton & Barto 2018 (2nd edition), pp.331-332 :
In REINFORCE with baseline, the learned state-value function estimates the value of the only the first state of each state transition. This estimate sets a baseline for the subsequent return, but is made prior to the transition’s action and thus cannot be used to assess that action. In actor–critic methods, on the other hand, the state-value function is applied also to these cond state of the transition. The estimated value of the second state, when discounted and added to the reward, constitutes the one-step return,Gt:t+1,which is a useful estimate of the actual return and thus is a way of assessing the action. As we have seen in the TD learning of value functions throughout this book, the one-step return is often superior to the actual return in terms of its variance and computational congeniality, even though it introduces bias. We also know how we can flexibly modulate the extent of the bias with n-step returns and eligibility traces (Chapters 7 and 12). When the state-value function is used to assess actions in this way it is called acritic,and the overall policy-gradient method is termed an actor–critic method. Note that the bias in the gradient estimate is not due to bootstrapping as such; the actor would be biased even if the critic was learned by a Monte Carlo method
I think td_error in AC is same with advantage in baseline solution, which are all reward minus predicted value.
One difference is AC value network is learning in TD, baseline solution is learning directly
I think they are the same in essence.