Open jbirkesteen opened 2 years ago
I looked around on the internet and couldn't really find something definitively answering the question. It would take some more time to dive into. I think that the plots should look somewhat like what we've seen in other DL cases, but we should keep in mind that the policy gradient approach often takes a lot of time.
Below I've linked some of what I found on the internet. Then I've added some comments on what we observe when looking at plots for current trained agents.
I found a few articles, two of which I want to highlight. I found a thesis on RL (mostly policy gradient) from 2016 which could be interesting to read, if we had the time. Some sources talk about the fact that policy gradient needs to train for a long time, so maybe what we have observed so far shouldn't surprise. This source seems to have lots of possible approaches to combat the challenge, however. Obviously, we do not have time to dive into this, but I wanted to share it anyway. Might be interesting to pick up sometime down the road.
Also, an interesting study which proposes a new objective directly leveraging minimax.
I also found a tutorial for policy gradient that I plan to look at later (TM).
Almost all models from the .csv as of yesterday have been trained. However, I don't see any model being necessarily worthy of training for the next 24-36 hours.
A general pattern we observe when using our current training strategy for minimax (both with pretrained and new models): Optimizer finds a minimum already after around 10k-15k games. All other plots(winrate, avgprobs) converge around the same time. Additionally, avgprobs(both win and loss) are generally very high when training against minimax. Generally, I think this behaviour comes from either
the training strategy is bad. With 1/3 random and 2/3 minimax, the agent just learns to win over the random one and most often loses to the minimax. Having one very easy opponent and one difficult opponent confuses the algorithm enough that it can't learn advanced strategy.
other hyperparameters. Maybe optimizer finds a local minimum too quickly and can't escape it again - it certainly looks like it on the plots. Should we experiment with a learning rate scheduler?
My guess is that the observed behaviour is primarily due to point 1. If we have time for it, I think we should try out training on only minimaxers and see if the agent learns better in this way. There might be a point in replacing the randoms with earlier generations of the agent instead. This might give different results for pretrained and newly initialised networks.
The winrate for the pretrained model takes a little more than 15k episodes to drop (and converge) to 25 %, while the winrate for non-pretrained models take around 7-8k episodes to rise (and converge) at around 30 %. My guess: In the beginning, pretrained wins all the time against random, but only rarely wins against the minimax. The penalty from losing to minimax destroys its strategy against random and doesn't really teach it to win against minimax. On the other hand, the non-pretrained just learns StackAttack immediately and doesn't evolve further. Extra question: What's going on with the avgprobs for pretrained? Probloss ends up being higher than probwin. We should probably look at how the agents play and find out if I am correct in my assumption.
The training plots for TequilaBoi behaves much like the Defender reward systems. AverageJoe gets the highest winrate against itself, but I think it most of the time replays the same games.
It seems like Small
generally does better than Mini
.
To sum up what I've found:
Mini
architecture.I've made a few edits to how Neptune views the game. I think we should look at how agents play and add tags for: StackAttack CanBlock
Honestly, I think its very difficult to decide on which agents to train further. I think we should start out by changing the minimax training strategy, and see what happens when only training against minimax, or alternatively replace randoms with earlier generations. Those not training against minimax most likely find exploits we don't want them to find, while those utilising current minimax training strategy probably just win over random.
How do we even expect loss plots to look? Should they look like what we've otherwise seen in the DL course? Maybe find some examples from the litterature.