Sutton's book defines the e-greedy policy as such (pages 27-28, 2nd edition):
A simple alternative is to behave greedily most of the time, but every once in a while, say with small probability epsilon, instead select randomly from among all the actions with equal probability, independently of the action-value estimates.
The implementation of Q-Learning in this repository does the contrary, so I have fixed that.
Sutton's book defines the e-greedy policy as such (pages 27-28, 2nd edition):
The implementation of Q-Learning in this repository does the contrary, so I have fixed that.