[x] next legal moves should be used to calculate discounted reward in train_agent in DeepQLearningAgent
[x] next legal moves and next player should be added to the history object of Game class
[x] next legal moves should be passed to the replay buffer
[x] done should be 1 for all players in whatever their last move for the game was
[x] convert board to float 32 before passing to the network (otherwise raises errors in tf sometimes)
[x] change dtype to np.uint64 whenever adding bitboard to numpy arrays for consistency
[x] change buffer dtypes accordingly
[x] make add_to_buffer in class Game generic to not hard code indices
[x] in case of no next legal moves, the function train_agent in class DeepQLearningAgent encounters nan because np.max(np.where(next_legal_moves==1, discounted_reward, -np.inf),axis=1) * (1-done) will not make np.inf * 0 equal 0 as np.inf * 0 = np.nan. Use a check for np.isfinite()
np.max(np.where(next_legal_moves==1, discounted_reward, -np.inf),axis=1) * (1-done)
will not makenp.inf * 0
equal 0 asnp.inf * 0 = np.nan
. Use a check fornp.isfinite()