MARIO-Math-Reasoning / Super_MARIO

MIT License
254 stars 16 forks source link

Error occured at SFT training #20

Closed jt4n closed 1 month ago

jt4n commented 1 month ago

Hi, I'm reproducing your job.

When I use the round3_training_data.json data to sft the deepseek-math-7b-base-value_model (after added value head), I got below error:

File "/home/workspace/LLaMA-Factory-0.6.1/LLaMA-Factory/src/llmtuner/train/sft/trainer.py", line 138, in compute_loss
    lm_logits, loss, values = model(**inputs, output_hidden_states=True, return_dict=True)
ValueError: not enough values to unpack (expected 3, got 2)

I added the modified compute_loss function you offered in this page, to the llama_factory CustomSeq2SeqTrainer class to override the original compute_loss of transformers Trainer class.

The training code is like:

CUDA_VISIBLE_DEVICES=6 python src/train_bash.py \
    --stage sft \
    --do_train \
    --model_name_or_path /home/workspace/ModelWeights/deepseek-math-7b-base-value_model \
    --dataset alpha_math_round3 \
    --template vanilla \
    --finetuning_type full \
    --output_dir /home/workspace/trained_models/alpha-math-7b-value-model-new \
    --overwrite_cache \
    --cutoff_len 1024 \
    --preprocessing_num_workers 16 \
    --per_device_train_batch_size 8 \
    --gradient_accumulation_steps 128 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 4e-5 \
    --num_train_epochs 10.0 \
    --plot_loss \
    --fp16

Did I make any mistake? How to solve it?

Chen-GX commented 1 month ago

Thank you for your interest in our work. You should invoke the AutoModelForCausalLMWithValueHead in trl for training to avoid this error. You can refer to Section Model Loader in our implementation_details.md for more details. I have updated this file, and thanks for your good question.

Chen-GX commented 1 month ago

Additionally, you should use bf16 in your training script instead of fp16 to avoid any potential errors.

jt4n commented 1 month ago

Thanks a lot! I can start the training process referring the Model Loader section.

jt4n commented 1 month ago

Hi, I have encountered another problem and would like to ask for your advice.

I try to evaluate the intermediate checkpoint by running the script configs/sbs_sft.yaml. But it came to below error:

Traceback (most recent call last):
  File "/home/workspace/Super_MARIO/solver_demo.py", line 47, in <module>
    solver = Solver(config=config)
  File "/home/workspace/Super_MARIO/mcts_math/solver.py", line 73, in __init__
    self.llm = self.create_llm()
  File "/home/workspace/Super_MARIO/mcts_math/solver.py", line 92, in create_llm
    engine, sampling_params = llm_engine(self.config)
  File "/home/workspace/Super_MARIO/mcts_math/llms/local_llm_engine.py", line 48, in llm_engine
    llm, sampling_params = llm_init(config)
  File "/home/workspace/Super_MARIO/mcts_math/llms/local_llm_engine.py", line 26, in llm_init
    llm = LLM(
  File "/home/workspace/vllm/vllm/entrypoints/llm.py", line 109, in __init__
    self.llm_engine = LLMEngine.from_engine_args(engine_args)
  File "/home/workspace/vllm/vllm/engine/llm_engine.py", line 145, in from_engine_args
    engine = cls(*engine_configs,
  File "/home/workspace/vllm/vllm/engine/llm_engine.py", line 102, in __init__
    self.model_executor = executor_class(model_config, cache_config,
  File "/home/workspace/vllm/vllm/executor/gpu_executor.py", line 35, in __init__
    self._init_worker()
  File "/home/workspace/vllm/vllm/executor/gpu_executor.py", line 63, in _init_worker
    self.driver_worker.load_model()
  File "/home/workspace/vllm/vllm/worker/worker.py", line 98, in load_model
    self.model_runner.load_model()
  File "/home/workspace/vllm/vllm/worker/model_runner.py", line 90, in load_model
    self.model = get_model(self.model_config,
  File "/home/workspace/vllm/vllm/model_executor/model_loader.py", line 88, in get_model
    model.load_weights(model_config.model, model_config.download_dir,
  File "/home/workspace/vllm/vllm/model_executor/models/modeling_value_head.py", line 154, in load_weights
    self.pretrained_model.load_weights(model_name_or_path, cache_dir,
  File "/home/workspace/vllm/vllm/model_executor/models/llama.py", line 399, in load_weights
    param = params_dict[name]
KeyError: 'pretrained_model.lm_head.weight'

The intermediate training files is like: image I can see the lack of config.json, generation_config.json, model.safetensors.index.json, value_head.pth. I tried copy these files from the original ckpt but still have the same error.

And I can successfully run evaluation on both the original ckpt (added value head) and the Round3 ckpt you released on Huggingface.

The training script is like:

export CUDA_VISIBLE_DEVICES=3,4,5,6
deepspeed --num_gpus 4 src/train_bash.py \
    --deepspeed /home/workspace/LLaMA-Factory-0.6.1/LLaMA-Factory/examples/deepspeed/ds_z3_offload_config.json \
    --ddp_timeout 180000000 \
    --stage sft \
    --do_train \
    --model_name_or_path /home/workspace/ModelWeights/deepseek-math-7b-base-value_model \
    --dataset alpha_math_round3 \
    --template vanilla \
    --finetuning_type full \
    --output_dir /home/workspace/trained_models/alpha-math-7b-value_model-new \
    --overwrite_cache \
    --overwrite_output_dir \
    --cutoff_len 1024 \
    --preprocessing_num_workers 16 \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 4e-5 \
    --num_train_epochs 10.0 \
    --plot_loss \
    --bf16

Is there any detail I missed?

Chen-GX commented 1 month ago

Thanks for your insight question. Your code dose not process the checkpoint properly. I have updated the implementation_details.md for this issue. Please refer to Section Save checkpoint.

Feel free for any question. 🥳

jt4n commented 1 month ago

Thanks for your explanation.

The problem is caused by saving the value head in the overall weights but the script tried to load them separately. Can be fixed by designed to save the value_head.pth apart from each ckpt (before train). Also I found it can be fixed by manually split the value head from a trained ckpt (after train).

Another noteworthy aspect is that, the llama-factory-0.6.1 will add a "pretrained_model." prefix to the weight keys (on my machine for unknown reason), I need to remove it to deploy the trained ckpt.

Chen-GX commented 1 month ago

In my opinion, your code dose not process the checkpoint properly. You can refer to the Section Save checkpoint.

jt4n commented 1 month ago

I'm working on it. Does it mean the ValueTrainer train both the policy model and value model (head) at the same time?

Chen-GX commented 1 month ago

Yes. ValueTrainer is just what I name the Trainer for Alphamath, and it doesn't matter. You should add + [FixValueHeadModelCallback()] in callbacks and modify V_HEAD_WEIGHTS_NAME = "value_head.pth" in ./llmtuner/extras/constants.py.

jt4n commented 1 month ago

Thank you very much for your patient guidance. I've reproduced the training stage on Round3 data successfully.

And I got the accuracy 0.65 after running SBS(B=3) inference on the MATH testset, which is very close to the data you published.