e_greedy_MC - Githubissues

Hi, I am very interested in your code and thank you for making Dr. Zhao's course available as code for us to study.

I have a question in the code in e_greedy_MC. When calculating the state value in line 163, is it necessary to include the action value under other probabilities in the calculation.

161 policy[s, idx] = 1 - epsilon * (len(env.action_space) - 1) / len(env.action_space) 162 policy[s, np.arange(len(env.action_space)) != idx] = epsilon / len(env.action_space) 163 V[s] = max(Q[s])

If max(Q[s]) is used as the state value for that state, then the probability of the maximum action value in Bellman Equation is 1. But actually according to that formula, its probability is not 1. Is it considered to add other action value to the calculation according to that probability formula.

I look forward to your reply.

SupermanCaozh / The_Coding_Foundation_in_Reinforcement_Learning

e_greedy_MC #3