dennybritz / reinforcement-learning

Implementation of Reinforcement Learning Algorithms. Python, OpenAI Gym, Tensorflow. Exercises and Solutions to accompany Sutton's Book and David Silver's course.
http://www.wildml.com/2016/10/learning-reinforcement-learning/
MIT License
20.41k stars 6.01k forks source link

Policy evaluation formulation #130

Open Jaijith opened 6 years ago

Jaijith commented 6 years ago

in the policy evaluation and policy iteration solution.ipynb; why is the value fuction calculated with the below equation. v += action_prob prob (reward + discount_factor * V[next_state])

Shouldn't the value function be calculated as per the below equation v += action_prob (reward + probdiscount_factor * V[next_state]) The agent gets the reward as soon as it takes the action and the transistion probability is multiplied nly to the value function of the next state.

Correct me if I am wrong

chaonan99 commented 6 years ago

See page 59 of the textbook.

Jaijith commented 6 years ago

Thanks a lot

On 17 Jan 2018 12:44 a.m., "Haonan Chen" notifications@github.com wrote:

See page 59 of the textbook http://incompleteideas.net/book/bookdraft2018jan1.pdf.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dennybritz/reinforcement-learning/issues/130#issuecomment-358072614, or mute the thread https://github.com/notifications/unsubscribe-auth/AXPaYthT0s6y2ZVCGct6S4adMLR6qpKoks5tLPUWgaJpZM4RfRzk .

Jaijith commented 6 years ago

In David silvers lecture, policy evaluation is always written with only state transition probability summed up on all rewards. Got confused with the notations

On 17 Jan 2018 9:01 a.m., "jaijith s" s.jaijith@gmail.com wrote:

Thanks a lot

On 17 Jan 2018 12:44 a.m., "Haonan Chen" notifications@github.com wrote:

See page 59 of the textbook http://incompleteideas.net/book/bookdraft2018jan1.pdf.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dennybritz/reinforcement-learning/issues/130#issuecomment-358072614, or mute the thread https://github.com/notifications/unsubscribe-auth/AXPaYthT0s6y2ZVCGct6S4adMLR6qpKoks5tLPUWgaJpZM4RfRzk .

memoiry commented 6 years ago

It's actually the same.

\Sum_{next_state, r} prob(next_state, r| s, a) * (reward + discount_factor * V[next_state]) = \Sum_{next_state, r} prob(next_state, r| s, a) * reward + \Sum_{next_state, r} prob(next_state, r| s, a) * discount_factor * V[next_state])

And we have \Sum_{next_state, r} prob(next_state, r| s, a) = 1.

So the equation above is equal to reward + \Sum_{next_state, r} prob(next_state, r| s, a) * discount_factor * V[next_state])