Open dszpr opened 1 month ago
PRM/data
to train Mistral-7B as the initial process reward model and obtain VALUE_MODEL_STATE_DICT
.
We also provide PRM/train_VM_chatglm.py
and PRM/train_VM_mistral.py
.Thanks! I noticed that you updated code last week. May I ask what is and where to find these two jsonfile llama_local_critic_dpo.json and mistral_local_critic_dpo.json mentioned in https://github.com/THUDM/ReST-MCTS/blob/main/self_train/self_train_dpo.py
Thanks! I noticed that you updated code last week. May I ask what is and where to find these two jsonfile llama_local_critic_dpo.json and mistral_local_critic_dpo.json mentioned in https://github.com/THUDM/ReST-MCTS/blob/main/self_train/self_train_dpo.py
same issue. how to make dpo dataset?
Appreciate for the great work? I tried to run MCTS* search following README and I wonder what VALUE_MODEL_STATE_DICT is. Besides, I notice that you upload a model on HF, 'zd21/ReST-MCTS-Llama3-8b-Instruct-Policy-1st', is it a inference model or a value model? Looking forward to your reply!