the code is very well documented; almost every line has a comment, making the code very understandable.
a very clever idea is sorting the state before creating the key for a dictionary/list. Otherwise, the same state reached with two different trajectories is not treated the same.
the minmax function is very sound and well done with two caches to save time.
areas of improvement:
the final results are high but not optimal against a completely random opponent. In my opinion, you should try using different rewards, like 10 for winning and -1 for losing.
constants should be in capital letters.
another nice idea may be playing and training against a more clever opponent. For example, an opponent that checks if it has a winning move or if the other player has a winning move and acts accordingly.
hi,
what I appreciate about your code:
areas of improvement: