ShangtongZhang / reinforcement-learning-an-introduction

Python Implementation of Reinforcement Learning: An Introduction
MIT License
13.47k stars 4.81k forks source link

Chapter 3: GridWorld #78

Closed ychong closed 6 years ago

ychong commented 6 years ago

Hi Shangtong,

First of all, thank you so much for writing these python codes for the RL textbook.

For chapter03's GridWorld.py, I believe there might be an error in the updating code for the 'Optimal Policy' or Bellman Optimality Equation. The update does not take into account the probability of actions, pi.

I understand that in the book, the Bellman Optimality equation (3.19) does not include probability of actions, pi. But if you inspect closer, the maximization of probability of actions are done through the last term V*(s'), where it is defined as max pi V(s').

May I suggest the codes for line 140-142 to be fixed as follow (suggestions in bold):

for action in actions: newPosition = nextState[i][j][action] values.append(actionReward[i][j][action] + discount * world[newPosition[0], newPosition[1]])

*newWorld[i][j] = np.max([ab for a,b in zip(actionProb[i][j].values(), values)])**

The probability of actions, pi are important because one could run into a case where the Future Rewards are large but the probability of an action is very very small, making the expected value very small. If we just maximize the rewards (without taking into account the probability of actions), we might be picking the wrong optimal policy.

Please let me know what you think.

I can be reached at chongyixiang@gmail.com

Sincerely, Chong Yi Xiang

ShangtongZhang commented 6 years ago

You are wrong. world is the current estimation of optimal value function, not a value function under a certain policy.

ychong commented 6 years ago

Yes Shangtong. I am wrong on this. Apologies. Thank you for clarifying. The non-stochastic optimal value function assumes you maximise the next best action and its reward instead of choosing a value function based on average rewards.