Implement a new agent using the TD($\lambda$)-algorithm, inspired by Tesauro's solution for backgammon, TD-gammon.
This extends existing player classes, but is different in several key aspects.
State representation
We represent game states using binary input nodes showing which spots are occupied and by whom. We create an input vector where:
Each of the first 42 nodes: 1 if this spot is occupied by the opponent, 0 else,
Each of the next 42 nodes: 1 if this spot is occupied by the player itself, 0 else.
This yields a total of 84 input nodes and is simpler than the one used by Tesauro for TD-backgammon. We may need to improve it to achieve satisfying results.
Self-play methodologies
We discussed several ways of implementing self-play:
Simple self-play
The agent plays against itself but only learns from the moves made by one of the sides.
"Complete" self-play
The agent plays against itself and learns from moves made by both sides.
Frozen self-play
We freeze the network of the opponent and play against this version, until we can beat it with some certainty. We then move on to train against the new and better agent.
We start out with simple self-play and can discuss if we want to try the others later.
Implement a new agent using the TD($\lambda$)-algorithm, inspired by Tesauro's solution for backgammon, TD-gammon. This extends existing player classes, but is different in several key aspects.
State representation
We represent game states using binary input nodes showing which spots are occupied and by whom. We create an input vector where:
Self-play methodologies
We discussed several ways of implementing self-play:
The agent plays against itself but only learns from the moves made by one of the sides.
We freeze the network of the opponent and play against this version, until we can beat it with some certainty. We then move on to train against the new and better agent.
We start out with simple self-play and can discuss if we want to try the others later.
Needed for training