I looked around on the internet and couldn't really find something definitively answering the question. It would take some more time to dive into. I think that the plots should look somewhat like what we've seen in other DL cases, but we should keep in mind that the policy gradient approach often takes a lot of time.

Below I've linked some of what I found on the internet. Then I've added some comments on what we observe when looking at plots for current trained agents.

Sources on policy gradient

I found a few articles, two of which I want to highlight. I found a thesis on RL (mostly policy gradient) from 2016 which could be interesting to read, if we had the time. Some sources talk about the fact that policy gradient needs to train for a long time, so maybe what we have observed so far shouldn't surprise. This source seems to have lots of possible approaches to combat the challenge, however. Obviously, we do not have time to dive into this, but I wanted to share it anyway. Might be interesting to pick up sometime down the road.

Also, an interesting study which proposes a new objective directly leveraging minimax.

I also found a tutorial for policy gradient that I plan to look at later (TM).

Comments and thoughts on our new agents

Almost all models from the .csv as of yesterday have been trained. However, I don't see any model being necessarily worthy of training for the next 24-36 hours.

Remark on minimax

A general pattern we observe when using our current training strategy for minimax (both with pretrained and new models): Optimizer finds a minimum already after around 10k-15k games. All other plots(winrate, avgprobs) converge around the same time. Additionally, avgprobs(both win and loss) are generally very high when training against minimax. Generally, I think this behaviour comes from either

the training strategy is bad. With 1/3 random and 2/3 minimax, the agent just learns to win over the random one and most often loses to the minimax. Having one very easy opponent and one difficult opponent confuses the algorithm enough that it can't learn advanced strategy.
other hyperparameters. Maybe optimizer finds a local minimum too quickly and can't escape it again - it certainly looks like it on the plots. Should we experiment with a learning rate scheduler?

My guess is that the observed behaviour is primarily due to point 1. If we have time for it, I think we should try out training on only minimaxers and see if the agent learns better in this way. There might be a point in replacing the randoms with earlier generations of the agent instead. This might give different results for pretrained and newly initialised networks.

Remark on pretrained vs. non-pretrained

The winrate for the pretrained model takes a little more than 15k episodes to drop (and converge) to 25 %, while the winrate for non-pretrained models take around 7-8k episodes to rise (and converge) at around 30 %. My guess: In the beginning, pretrained wins all the time against random, but only rarely wins against the minimax. The penalty from losing to minimax destroys its strategy against random and doesn't really teach it to win against minimax. On the other hand, the non-pretrained just learns StackAttack immediately and doesn't evolve further. Extra question: What's going on with the avgprobs for pretrained? Probloss ends up being higher than probwin. We should probably look at how the agents play and find out if I am correct in my assumption.

Remark on non-minimax

The training plots for TequilaBoi behaves much like the Defender reward systems. AverageJoe gets the highest winrate against itself, but I think it most of the time replays the same games.

Remark on architecture

It seems like Small generally does better than Mini.

What do we do from here?

To sum up what I've found:

Minimax training strategy is probably bad. Due to this we can't compare BasicBitch to the other reward systems, as we have only trained this one on Minimax. Additionally, we cannot know if its high avgprob is just due to playing against minimax or not.
Most promising reward systems are AverageJoe(highest winrate, very high avgprobs(also when losing)), Defender and Tequilaboi , even though they only win-rate of 50% when training, which is probably just because maybe this is just because the starter wins)
We need to look into some oddities with the behaviour of agents training against minimax.
RuleBoi never learns the rules.
We should probably forget about Mini architecture.

I've made a few edits to how Neptune views the game. I think we should look at how agents play and add tags for: StackAttack CanBlock

Honestly, I think its very difficult to decide on which agents to train further. I think we should start out by changing the minimax training strategy, and see what happens when only training against minimax, or alternatively replace randoms with earlier generations. Those not training against minimax most likely find exploits we don't want them to find, while those utilising current minimax training strategy probably just win over random.

RasmusBrostroem / ConnectFourRL

Loss plots in RL #34