Closed tongyx361 closed 2 months ago
Thank you for your insightful question. For each iteration round, we choose to initialize the model from a pre-trained model rather than the model from the last iteration round. In our experiments, we have found that the quality of autonomously generated solution improves with each successive round. However, relying on the model from the previous round may lead to convergence to local minima due to the influence of suboptimal solutions. Thus, reintroducing the pre-trained model at the beginning of each round serves as a strategy to prevent compounding potential biases or errors and aids in avoiding the entrapment in local minima. Continuing supervised fine-tuning (SFT) without this reinitialization might be inadequate in overcoming these local optima and thereby could limit the overall improvement of the model.
Thanks for your clarification!
I could not find the exact implementation.