Implementation of Reinforcement Learning Algorithms. Python, OpenAI Gym, Tensorflow. Exercises and Solutions to accompany Sutton's Book and David Silver's course.
To set policy_stable variable, provided code checks whether the policy is changed. If there are multiple optimal policies, the policy may change infinitely even though optimal policy is already found.
See Exercise 4.4 of the 2018 edition in Sutton & Barto book, it explicitly points out this bug in the pseudocode.
To set
policy_stable
variable, provided code checks whether the policy is changed. If there are multiple optimal policies, the policy may change infinitely even though optimal policy is already found.See Exercise 4.4 of the 2018 edition in Sutton & Barto book, it explicitly points out this bug in the pseudocode.