THUDM / ReST-MCTS

ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search (NeurIPS 2024)
343 stars 24 forks source link

Hope for a more detailed README! #6

Open PKUfreshman opened 1 month ago

PKUfreshman commented 1 month ago

Sorry to interrupt! I really appreciate your work, but I can't do either inference or self-training based on README. For inference, I followed the README but failed to run evaluate.py. What are VALUE_BASE_MODEL_DIR and VALUE_MODEL_STATE_DICT? What's more, the model you release on HF, zd21/ReST-MCTS-Llama3-8b-Instruct-Policy-1st, seems to have some problems. I've tried many times, but it reports an error when loading the checkpoint shard: safetensors_rust.SafetensorError: Error while deserializing header: MetadataIncompleteBuffer Someone also raised this question on HF.

For training, the README introduces nothing about it.

Hope for an update for your README and maybe double-check your HF model. Thank you very much!

sarvghotra commented 1 month ago

zhangdan0602 +1 Would appreciate if you could add some instructions to train the models.

zhangdan0602 commented 1 month ago

We have updated README.md to illustrate the details. Specifically, VALUE_BASE_MODEL_DIR is the local path to the value model. Considering the different dependency versions of transformers, Mistral-7B is adopted as the backbone of the value model when the policy model is Llama3-8B-Instruct or MetaMATH: Mistral-7B. When the policy model is SciGLM, we use ChatGLM3-6B as the backbone of the value model.

In addition, you can download [$D_{V_0}$] and put them in PRM/data to train Mistral-7B as the initial process reward model and obtain VALUE_MODEL_STATE_DICT. We also provide PRM/train_VM_chatglm.py and PRM/train_VM_mistral.py.

PKUfreshman commented 1 month ago

We have updated README.md to illustrate the details. Specifically, VALUE_BASE_MODEL_DIR is the local path to the value model. Considering the different dependency versions of transformers, Mistral-7B is adopted as the backbone of the value model when the policy model is Llama3-8B-Instruct or MetaMATH: Mistral-7B. When the policy model is SciGLM, we use ChatGLM3-6B as the backbone of the value model.

In addition, you can download [$D_{V_0}$] and put them in PRM/data to train Mistral-7B as the initial process reward model and obtain VALUE_MODEL_STATE_DICT. We also provide PRM/train_VM_chatglm.py and PRM/train_VM_mistral.py.

Thanks!