Closed josephbak closed 4 years ago
So, whenever we sample action probs from that policy it uses the latest updated Q values you can also create a new policy inside the episode loop wouldn't change nothing. This happens because you are creating the policy from inside mc_control_epsilon_greedy method which uses the same instance of Q values if you look closely so it should logically use the updated Q values.
Think of it as creating the policy method inside mc_control_epsilon_greedy method that would always use updated Q values.
Thanks for the comment, @makaveli10. I have confirmed that the policy is indeed being improved implicitly. But I think I am still not fully understanding this concept. The policy is initialized before the for loop for i_episode in range(1, num_episodes + 1):
. I know that the policy uses Q values when it's initialized and Q is being updated in the loop. It just seems not clear to me that the policy is being updated by changing Q after it's initialization.
Hi, In the mc_control_epsilongreedy function, the comment line before the return statement says that # The policy is improved implicitly by changing the Q dictionary_. For this to happen, shouldn't
policy = make_epsilon_greedy_policy(Q, epsilon, env.action_space.n)
line be in the episode loop to build a new policy according to changed Q ?