dennybritz / reinforcement-learning

Implementation of Reinforcement Learning Algorithms. Python, OpenAI Gym, Tensorflow. Exercises and Solutions to accompany Sutton's Book and David Silver's course.
http://www.wildml.com/2016/10/learning-reinforcement-learning/
MIT License
20.61k stars 6.04k forks source link

A mistake in policy gradient for cliff walking REINFORCE #138

Open stonyhu opened 6 years ago

stonyhu commented 6 years ago

Hi author, your code is just wonderful and it help me a lot on building deep reinforcement learning system for my project. But I found a mistake in the following code. When you print out steps,

print("\rStep {} @ Episode {}/{} ({})".format( t, i_episode + 1, num_episodes, stats.episode_rewards[i_episode - 1]), end="")

_stats.episode_rewards[iepisode - 1] why the index is _iepisode - 1, but i_episode begins with 0. I think the index should be _iepisode

If you have time to check it, I will be appreciated!

for t in itertools.count():

            # Take a step
            action_probs = estimator_policy.predict(state)
            action = np.random.choice(np.arange(len(action_probs)), p=action_probs)
            next_state, reward, done, _ = env.step(action)

            # Keep track of the transition
            episode.append(Transition(
              state=state, action=action, reward=reward, next_state=next_state, done=done))

            # Update statistics
            stats.episode_rewards[i_episode] += reward
            stats.episode_lengths[i_episode] = t

            # Print out which step we're on, useful for debugging.
            print("\rStep {} @ Episode {}/{} ({})".format(
                    t, i_episode + 1, num_episodes, stats.episode_rewards[i_episode - 1]), end="")
            # sys.stdout.flush()

            if done:
                break

            state = next_state
dennybritz commented 6 years ago

Hm, yes, that seems strange. Can you submit a pull request to fix it?