Closed erickrf closed 4 months ago
Hello,
I want to clarify the word “iteration”: you have to distinguish what I call “training iteration” (when the blue bar progresses a bit, target is fixed) and “learning iteration” (one set of self play, training and arena, 3 full progress bars).
On muzero, it may be different but on alphazero: there are hundreds or thousands of self-plays then the model is trained over the result of these games (plus some old ones too). During this training, there are several training iterations during which the target is fixed and therefore can be considered as supervised learning. Learning rate changes over this training. Then , after validation by an arena, it will self play again with new model. The results of these games are therefore new compared to the previous learning iteration, and training will aim a new target. Learning rate will follow same cycle as previous training.
In short, I consider each training as new and independent of the previous one, so starting a OneCycleLR from scratch each time doesn’t hurt.
i don’t have trainig plots, I’ve just compared the strength of models learned this way versus without one cycle and found a slight but significant improvement.
Thanks for the answer! I understand your point. What I find strange is that while you are using a new dataset at each training iteration, the model is the same, and I had the impression that restarting the schedule could lead to catastrophic forgetting.
The equivalent in classic supervised learning would be to train a model on one dataset after the other, resetting the schedule every time. It's kind of what fine-tuning pre-trained models do, but I'm not aware of the impact of doing this over and over so many times.
Dataset is the same at each training iteration (just using different batches as usual).
But dataset is different from each learning iteration: the point is to aim a new target. The new data is better since built on games with smarter models, so we want the training to forget the old data. Well, we keep the latest n iterations but the older ones are removed.
Ok, thanks again for answering. I see the bottom line is that the model still learns well.
Hi, I have recently stumbled upon this repository and am going through the code to better understand Alpha Zero.
One weird thing I noticed is the creation of the OneCycleLR scheduler each time the training function of the model is called. Since it happens at every iteration, the learning rate probably ends up very bumpy. The scheduler was created with supervised learning in mind, where the training process is more straightforward.
At the same time, the models seems to learn to play very well, so that cannot be so bad.
Do you have any insights on why it works? Or maybe you have a graph of the learning rates throughout training to illustrate what happens?