There was bug is in action choosing in Q-Learning code.
Action was selected randomly instead of using policy derived from current Q values.
In this commit I have updated it to use epsilon-greedy strategy.
NOTE: result answers will remain the same because it's rather simple environment thus Q-Learning also worked fine despite the bug.
There was bug is in action choosing in Q-Learning code. Action was selected randomly instead of using policy derived from current Q values. In this commit I have updated it to use epsilon-greedy strategy.
NOTE: result answers will remain the same because it's rather simple environment thus Q-Learning also worked fine despite the bug.
Counterpart pull request for
master
branch: #173