Closed jbirkesteen closed 2 years ago
mean loss had a better performance after 300000 games than the agent trained with sum loss. The mean loss agent knew how to block and win in the center, while the agent trained with sum_loss just knew stackattack.
It looked like the agent trained on sum_loss was still improving, while the agent trained on mean_loss didn't anymore, but that could just be because the agent trained on mean_loss didn't have strong enough opponents.
We conclude that the mean_loss method gives better convergence and will continue with this loss method
These are the combinations of architectures and training strategies that we want to train next. We start out with: Average Joe, small architecture (to see if bad performance is due to our extreme reward system), accumulate sum, lr=10e-3 Average Joe, small architecture (to see if bad performance is due to our extreme reward system), accumulate mean, lr=10e-3
Decide which accumulation to use moving on.
Then: Basic reward, mini architecture Basic reward, small architecture Reward(?), mini architecture, play against minimax agent Exotic reward, mini or small, play against minimax
old Reward(?), small architecture, play against minimax agent Basic reward, large architecture, play against minimax agent
Important note
We have introduced other hyperparameters that we want to explore, but our current remaining time for the project is limited. Suggestion: Try out some of our new hyperparameters (for instance #28, #30 and #35) against the model in #29 but maybe just for 100k games. Perhaps this could be done on my pc?