THUDM / WebRL

Building Open LLM Web Agents with Self-Evolving Online Curriculum RL
197 stars 9 forks source link

`SFT baseline` #1

Closed matbee-eth closed 4 days ago

matbee-eth commented 1 week ago

Are you able to supply the SFT Baseline model? Are you using a base or instruct/chat base model?

I may be slightly confused, is it necessary for us to train the SFT baseline model, or are we able to continued pretraining your published weights?

QZH-777 commented 1 week ago

Thank you for raising this issue! Here are detailed answers to your questions: Q1: We recommend training the SFT baseline model yourself, as we have made the dataset and code available. The model we use is the base model instead of the instruct/chat model. Q2: We use the SFT-trained model as the initial model for WebRL. Details about the WebRL process can be found in our paper. Since we introduce a new loss function (the KL-constrained policy update algorithm detailed in the paper), we are unsure whether our model parameters are ideal for continued pretraining.