Open YuhangZh opened 1 month ago
Hi there, sorry for the late reply. Yes, you're right that I did make a little mistake here calculating the state values. In fact, during the updates of the policy in each iteration, we use the action values instead of the state value, so I don't think it would change the result too much;) Anyway, thank you for pointing this out! Good luck!
Hi, I am very interested in your code and thank you for making Dr. Zhao's course available as code for us to study.
I have a question in the code in e_greedy_MC. When calculating the state value in line 163, is it necessary to include the action value under other probabilities in the calculation.
161 policy[s, idx] = 1 - epsilon * (len(env.action_space) - 1) / len(env.action_space)
162 policy[s, np.arange(len(env.action_space)) != idx] = epsilon / len(env.action_space)
163 V[s] = max(Q[s])
If max(Q[s]) is used as the state value for that state, then the probability of the maximum action value in Bellman Equation is 1. But actually according to that formula, its probability is not 1. Is it considered to add other action value to the calculation according to that probability formula.
I look forward to your reply.