DragonWarrior15 / othello-rl

0 stars 1 forks source link

othello-rl

This repo contains implementation of a few Othello-AI agents and a UI to play against them. Currently, the following AI are available

The Front-end of the game is implemented in JS, the server in Flask + Python, and the AI and some of the game environment logic is implemented in C. To use the different AI, dll files need to be placed in the repo directory ( Minimax, MCTS, Game Environment )
Run python game_server.py to start the UI

AI and game environment files

The game environment is available in multiple flavors in the file game_env.py

Similarly, the AI are also available in multiple flavors in players.py

Runtime Comparison for AI

MCTS Comparison of Python and C at 100 simulations per move

Game Environment Type Player TypeTime
Python Python with Python Environment0.16 it/s
C Python with C Environment0.33 it/s
C C16 it/s

Minimax Comparison of Python and C at max depth of 3

Game Environment Type Player TypeTime
Python Python with Python Environment2 it/s
C Python with C Environment5 it/s
C C130 it/s
C C (max depth 4)43 it/s

For the UI, the max depth of Minimax is capped at 9, while the number of simulations of MCTS is capped at 50000 to keep the AI move generation almost realtime.

DQN AI

A dqn version of the AI is also implemented in players.py under DeepQLearningAgent. To train this agent, run python training_dqn.py. This AI trains itself against a random player for set number of episodes using Convolutional Neural Network and Deep Q Learning.

Game Recording

The Game class also provides the facility to record game play between two players using matplotlib and FFmpeg (git-2019-12-01-637742b).

from game_env import Game
from players import MCTSPlayerC, MiniMaxPlayerC, RandomPlayer
board_size = 8

# initialize classes
p1 = MCTSPlayerC(board_size=board_size, n_sim=50000)
p2 = MiniMaxPlayerC(board_size=board_size, depth=9)
g = Game(player1=p1, player2=p2, board_size=board_size)
# play and record
g.play()
g.record_gameplay(path='images/gameplay_mcts_minimax.mp4')

Monte Carlo Tree Search

A stochastic method that uses simulations to determine the next best move to play. It is divided into 4 steps. Whenever we are given the task of choosing a move, we will initiate a fresh instance of MCTS.
The base idea of MCTS is to build a tree of game states by randomly choosing moves for either side. In this process, We may not be able to explore all the nodes (limited by total number of simulations), but for the nodes just after the root node (current state for which we have to choose the move) we expect to have a good idea of approximate win probabilities.

  1. Selection
  2. Select a node in the tree that is neither a leaf node, nor fullly explored.

    • Leaf node is a terminal node of the game
    • A node is not fully explored until all of it's children have been visited at least once.
    During the starting simulations, we will always choose the root node for this, and keep choosing it until all it's children have been visited at least once. If a node has been fully explored, we will select a child node using the UCB1 statistic. The one with highest value will be chosen among all the child nodes.

  3. Expansion
  4. Select one of the child node for this node which is unexplored. If all the child nodes have been explored, then we must be at the selection step itself. This child node represents one of the available leval moves. We play this move on the current state and get the updated board state variables.

  5. Simulation
  6. On this new node added to the game tree in the exapansion step, start playing till the end by randomly choosing moves. In case the node is a terminal node, we skip this step. This constitutes one full run of the game starting from this new node.

  7. Backpropagation
  8. One run of the game has ended and we update a couple of attributes of all the nodes (starting from the node added in the expansion phase, going all the way back up to the root node) that participated in this entire game run.

    • Add 1 to n, the attribute denoting how many times this node has taken part in a simulation.
    • Update the attribute N for all the children of this node. N denotes the total simulations the parent has participated in.
    • Update the attribute w denoting the number of wins this node participated in. w is updated by 1 when this node's color is opposite to the winner. This inversion happens because when choosing a child node in the selection phase, the UCB1 value helps us choose the best node to pick. This involves a component of the win rate. We want this win rate to represent the parent node's color win rate. The color of the parent node will be opposite to the current node. Hence the inversion