THUDM / ReST-MCTS

ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search (NeurIPS 2024)
309 stars 20 forks source link

What is VALUE_MODEL_STATE_DICT? #5

Open dszpr opened 1 month ago

dszpr commented 1 month ago

Appreciate for the great work? I tried to run MCTS* search following README and I wonder what VALUE_MODEL_STATE_DICT is. Besides, I notice that you upload a model on HF, 'zd21/ReST-MCTS-Llama3-8b-Instruct-Policy-1st', is it a inference model or a value model? Looking forward to your reply!

zhangdan0602 commented 1 month ago
  1. You can download [$D_{V_0}$] and put them in PRM/data to train Mistral-7B as the initial process reward model and obtain VALUE_MODEL_STATE_DICT. We also provide PRM/train_VM_chatglm.py and PRM/train_VM_mistral.py.
  2. 'zd21/ReST-MCTS-Llama3-8b-Instruct-Policy-1st' is an inference model.
dszpr commented 3 weeks ago

Thanks! I noticed that you updated code last week. May I ask what is and where to find these two jsonfile llama_local_critic_dpo.json and mistral_local_critic_dpo.json mentioned in https://github.com/THUDM/ReST-MCTS/blob/main/self_train/self_train_dpo.py

thunder95 commented 3 weeks ago

Thanks! I noticed that you updated code last week. May I ask what is and where to find these two jsonfile llama_local_critic_dpo.json and mistral_local_critic_dpo.json mentioned in https://github.com/THUDM/ReST-MCTS/blob/main/self_train/self_train_dpo.py

same issue. how to make dpo dataset?