Your code is well-structured and easy to follow, however I would appreciate some more comments.
Your README.md is very informative and the effort you put in it impressed me very much.
I appreciate that you visualized the results with a graph.
The only thing that I found odd is the update rule for the Q-value which penalizes future rewards (-1.0 *np.max(Q_S_t_next)). I thought that in Q-learning, the goal is to maximize future rewards and not penalize them.. so to add the discounted maximum future reward rather than subtracting it.
Hello Matteo,
Your code is well-structured and easy to follow, however I would appreciate some more comments. Your README.md is very informative and the effort you put in it impressed me very much. I appreciate that you visualized the results with a graph. The only thing that I found odd is the update rule for the Q-value which penalizes future rewards (-1.0 *np.max(Q_S_t_next)). I thought that in Q-learning, the goal is to maximize future rewards and not penalize them.. so to add the discounted maximum future reward rather than subtracting it.
Best of luck!