`SFT baseline` - Githubissues

Thank you for raising this issue! Here are detailed answers to your questions: Q1: We recommend training the SFT baseline model yourself, as we have made the dataset and code available. The model we use is the base model instead of the instruct/chat model. Q2: We use the SFT-trained model as the initial model for WebRL. Details about the WebRL process can be found in our paper. Since we introduce a new loss function (the KL-constrained policy update algorithm detailed in the paper), we are unsure whether our model parameters are ideal for continued pretraining.

THUDM / WebRL

`SFT baseline` #1