dennybritz / reinforcement-learning

Implementation of Reinforcement Learning Algorithms. Python, OpenAI Gym, Tensorflow. Exercises and Solutions to accompany Sutton's Book and David Silver's course.
http://www.wildml.com/2016/10/learning-reinforcement-learning/
MIT License
20.45k stars 6.02k forks source link

Some question in MC Control with Epsilon-Greedy Policies Solution.ipynb #224

Closed josephbak closed 4 years ago

josephbak commented 4 years ago

Hi, In the mc_control_epsilongreedy function, the comment line before the return statement says that # The policy is improved implicitly by changing the Q dictionary_. For this to happen, shouldn't policy = make_epsilon_greedy_policy(Q, epsilon, env.action_space.n) line be in the episode loop to build a new policy according to changed Q ?

makaveli10 commented 4 years ago

So, whenever we sample action probs from that policy it uses the latest updated Q values you can also create a new policy inside the episode loop wouldn't change nothing. This happens because you are creating the policy from inside mc_control_epsilon_greedy method which uses the same instance of Q values if you look closely so it should logically use the updated Q values.

Think of it as creating the policy method inside mc_control_epsilon_greedy method that would always use updated Q values.

josephbak commented 4 years ago

Thanks for the comment, @makaveli10. I have confirmed that the policy is indeed being improved implicitly. But I think I am still not fully understanding this concept. The policy is initialized before the for loop for i_episode in range(1, num_episodes + 1):. I know that the policy uses Q values when it's initialized and Q is being updated in the loop. It just seems not clear to me that the policy is being updated by changing Q after it's initialization.