Closed jt4n closed 1 month ago
Thank you for your interest in our work. You should invoke the AutoModelForCausalLMWithValueHead
in trl
for training to avoid this error.
You can refer to Section Model Loader
in our implementation_details.md for more details. I have updated this file, and thanks for your good question.
Additionally, you should use bf16
in your training script instead of fp16
to avoid any potential errors.
Thanks a lot! I can start the training process referring the Model Loader
section.
Hi, I have encountered another problem and would like to ask for your advice.
I try to evaluate the intermediate checkpoint by running the script configs/sbs_sft.yaml
. But it came to below error:
Traceback (most recent call last):
File "/home/workspace/Super_MARIO/solver_demo.py", line 47, in <module>
solver = Solver(config=config)
File "/home/workspace/Super_MARIO/mcts_math/solver.py", line 73, in __init__
self.llm = self.create_llm()
File "/home/workspace/Super_MARIO/mcts_math/solver.py", line 92, in create_llm
engine, sampling_params = llm_engine(self.config)
File "/home/workspace/Super_MARIO/mcts_math/llms/local_llm_engine.py", line 48, in llm_engine
llm, sampling_params = llm_init(config)
File "/home/workspace/Super_MARIO/mcts_math/llms/local_llm_engine.py", line 26, in llm_init
llm = LLM(
File "/home/workspace/vllm/vllm/entrypoints/llm.py", line 109, in __init__
self.llm_engine = LLMEngine.from_engine_args(engine_args)
File "/home/workspace/vllm/vllm/engine/llm_engine.py", line 145, in from_engine_args
engine = cls(*engine_configs,
File "/home/workspace/vllm/vllm/engine/llm_engine.py", line 102, in __init__
self.model_executor = executor_class(model_config, cache_config,
File "/home/workspace/vllm/vllm/executor/gpu_executor.py", line 35, in __init__
self._init_worker()
File "/home/workspace/vllm/vllm/executor/gpu_executor.py", line 63, in _init_worker
self.driver_worker.load_model()
File "/home/workspace/vllm/vllm/worker/worker.py", line 98, in load_model
self.model_runner.load_model()
File "/home/workspace/vllm/vllm/worker/model_runner.py", line 90, in load_model
self.model = get_model(self.model_config,
File "/home/workspace/vllm/vllm/model_executor/model_loader.py", line 88, in get_model
model.load_weights(model_config.model, model_config.download_dir,
File "/home/workspace/vllm/vllm/model_executor/models/modeling_value_head.py", line 154, in load_weights
self.pretrained_model.load_weights(model_name_or_path, cache_dir,
File "/home/workspace/vllm/vllm/model_executor/models/llama.py", line 399, in load_weights
param = params_dict[name]
KeyError: 'pretrained_model.lm_head.weight'
The intermediate training files is like:
I can see the lack of config.json, generation_config.json, model.safetensors.index.json, value_head.pth
.
I tried copy these files from the original ckpt but still have the same error.
And I can successfully run evaluation on both the original ckpt (added value head)
and the Round3 ckpt
you released on Huggingface.
The training script is like:
export CUDA_VISIBLE_DEVICES=3,4,5,6
deepspeed --num_gpus 4 src/train_bash.py \
--deepspeed /home/workspace/LLaMA-Factory-0.6.1/LLaMA-Factory/examples/deepspeed/ds_z3_offload_config.json \
--ddp_timeout 180000000 \
--stage sft \
--do_train \
--model_name_or_path /home/workspace/ModelWeights/deepseek-math-7b-base-value_model \
--dataset alpha_math_round3 \
--template vanilla \
--finetuning_type full \
--output_dir /home/workspace/trained_models/alpha-math-7b-value_model-new \
--overwrite_cache \
--overwrite_output_dir \
--cutoff_len 1024 \
--preprocessing_num_workers 16 \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 4 \
--lr_scheduler_type cosine \
--logging_steps 10 \
--save_steps 1000 \
--learning_rate 4e-5 \
--num_train_epochs 10.0 \
--plot_loss \
--bf16
Is there any detail I missed?
Thanks for your insight question. Your code dose not process the checkpoint properly.
I have updated the implementation_details.md
for this issue.
Please refer to Section Save checkpoint
.
Feel free for any question. 🥳
Thanks for your explanation.
The problem is caused by saving the value head in the overall weights but the script tried to load them separately. Can be fixed by designed to save the value_head.pth
apart from each ckpt (before train). Also I found it can be fixed by manually split the value head from a trained ckpt (after train).
Another noteworthy aspect is that, the llama-factory-0.6.1
will add a "pretrained_model."
prefix to the weight keys (on my machine for unknown reason), I need to remove it to deploy the trained ckpt.
In my opinion, your code dose not process the checkpoint properly. You can refer to the Section Save checkpoint.
I'm working on it. Does it mean the ValueTrainer
train both the policy model and value model (head) at the same time?
Yes. ValueTrainer
is just what I name the Trainer for Alphamath
, and it doesn't matter.
You should add + [FixValueHeadModelCallback()]
in callbacks and modify V_HEAD_WEIGHTS_NAME = "value_head.pth"
in ./llmtuner/extras/constants.py
.
Thank you very much for your patient guidance. I've reproduced the training stage on Round3
data successfully.
And I got the accuracy 0.65 after running SBS(B=3)
inference on the MATH testset, which is very close to the data you published.
Hi, I'm reproducing your job.
When I use the
round3_training_data.json
data to sft thedeepseek-math-7b-base-value_model
(after added value head), I got below error:I added the modified compute_loss function you offered in this page, to the llama_factory
CustomSeq2SeqTrainer
class to override the original compute_loss of transformersTrainer
class.The training code is like:
Did I make any mistake? How to solve it?