We need a semi-MiniMax algorithm, that is able to look x steps ahead, like 3 or 5 and determine the best move, and if not best move can be determined after 3 to 5 moves, then we just choose a random legal action.
This should make it so the policy gradient agent has someone to train against that it should be able to win against if it plays well, but will have a hard time winning against, so it has to learn difficult strategies.
We need a semi-MiniMax algorithm, that is able to look x steps ahead, like 3 or 5 and determine the best move, and if not best move can be determined after 3 to 5 moves, then we just choose a random legal action.
This should make it so the policy gradient agent has someone to train against that it should be able to win against if it plays well, but will have a hard time winning against, so it has to learn difficult strategies.