dennybritz / reinforcement-learning

Implementation of Reinforcement Learning Algorithms. Python, OpenAI Gym, Tensorflow. Exercises and Solutions to accompany Sutton's Book and David Silver's course.
http://www.wildml.com/2016/10/learning-reinforcement-learning/
MIT License
20.45k stars 6.02k forks source link

Is a line missing in 'MC Control with Epsilon-Greedy Policies Solution.ipynb'? #220

Open Ritz111 opened 4 years ago

Ritz111 commented 4 years ago

In the function mc_control_epsilon_greedy:

        # Find all (state, action) pairs we've visited in this episode
        # We convert each state to a tuple so that we can use it as a dict key
        sa_in_episode = set([(tuple(x[0]), x[1]) for x in episode])
        for state, action in sa_in_episode:
            sa_pair = (state, action)
            # Find the first occurance of the (state, action) pair in the episode
            first_occurence_idx = next(i for i,x in enumerate(episode)
                                       if x[0] == state and x[1] == action)
            # Sum up all rewards since the first occurance
            G = sum([x[2]*(discount_factor**i) for i,x in enumerate(episode[first_occurence_idx:])])
            # Calculate average return for this state over all sampled episodes
            returns_sum[sa_pair] += G
            returns_count[sa_pair] += 1.0
            Q[state][action] = returns_sum[sa_pair] / returns_count[sa_pair]

        # The policy is improved implicitly by changing the Q dictionary

    return Q, policy

I think a line should be added upon the last line:

            Q[state][action] = returns_sum[sa_pair] / returns_count[sa_pair]

        # The policy is improved implicitly by changing the Q dictionary
        policy = make_epsilon_greedy_policy(Q, epsilon, env.action_space.n)

    return Q, policy

Otherwise the policy will not upgrade.

makaveli10 commented 4 years ago

@Ritz111 No, it'll update. Actually, the policy is updating as Q values are updating because it is fetching the next action according to the current Q values.