Closed matbee-eth closed 4 days ago
Thank you for raising this issue! Here are detailed answers to your questions: Q1: We recommend training the SFT baseline model yourself, as we have made the dataset and code available. The model we use is the base model instead of the instruct/chat model. Q2: We use the SFT-trained model as the initial model for WebRL. Details about the WebRL process can be found in our paper. Since we introduce a new loss function (the KL-constrained policy update algorithm detailed in the paper), we are unsure whether our model parameters are ideal for continued pretraining.
Are you able to supply the SFT Baseline model? Are you using a base or instruct/chat base model?
I may be slightly confused, is it necessary for us to train the SFT baseline model, or are we able to continued pretraining your published weights?