Hi Gabriele,
First of all well done, your code was very clear to me.
I find it very good that you trained the q-agent not only through a random player but also using a min max agent, this definitely allows the q table to be more accurate as more important states are learned in order to optimize the moves.
However, I have some advice that I think can make your code even better, mainly regarding your exploration-exploitation trade off balance:
Regarding the epsilon parameter, I suggest you do a more comprehensive tuning, especially for the training phase, trying various types of decrements, from linear to exponential, so you can find the best
About the move choice strategy on the other hand, in addition to epsilon greedy, I recommend you try additional ones such as upper bound confidence and softmax ( or boltzman) exploration, so you can see if there is one that performs better.
I hope my suggestions are helpful to you and best of luck.
Hi Gabriele, First of all well done, your code was very clear to me. I find it very good that you trained the q-agent not only through a random player but also using a min max agent, this definitely allows the q table to be more accurate as more important states are learned in order to optimize the moves. However, I have some advice that I think can make your code even better, mainly regarding your exploration-exploitation trade off balance:
I hope my suggestions are helpful to you and best of luck.