bigcode-project / starcoder

Home of StarCoder: fine-tuning & inference!
Apache License 2.0
7.33k stars 522 forks source link

Usage of LoadBestPeftModelCallback in Finetuning stage #136

Open ttssp opened 1 year ago

ttssp commented 1 year ago

Hi friends,

I was trying to test the finetune/finetune.py script. It seems that state.best_model_checkpoint always return None leading to a failure at the end of the program. Is it that the program did not save a "best model" during training? I am a bit new to this, could anyone give some explanation on this and offer some hints on solving it? Thanks a lot!

command(single GPU):

python finetune/finetune.py --model_path="../../models/starcoder/" --dataset_name="../../datasets/ArmelR/stack-exchange-instruction" --subset="data/finetune" --split="train" --size_valid_set 10 --streaming --seq_length 2048 --max_steps 2 --batch_size 1 --input_column_name="question" --output_column_name="response" --gradient_accumulation_steps 1 --learning_rate 1e-4 --lr_scheduler_type="cosine" --num_warmup_steps 1 --weight_decay 0.05 --output_dir="./checkpoints"

error image(single GPU):

企业微信截图_deeb82a7-668e-459c-882d-a72ca8ebacf4

command(mulit GPUs): python -m torch.distributed.launch --nproc_per_node 4 finetune/finetune.py --model_path="../../models/starcoder/" --dataset_name="../../datasets/ArmelR/stack-exchange-instruction" --subset="data/finetune" --split="train" --size_valid_set 10000 --streaming --seq_length 2048 --max_steps 2 --batch_size 1 --input_column_name="question" --output_column_name="response" --gradient_accumulation_steps 16 --learning_rate 1e-4 --lr_scheduler_type="cosine" --num_warmup_steps 100 --weight_decay 0.05 --output_dir="./checkpoints"

error image(mulit GPU):

image
upjabir commented 1 year ago

@ttssp I believe by default save_steps=100, and you are trying to run fine-tuning only for 2 steps. Try reducing save_steps to 1 or 2.